“The line between failure and success is so fine. . . that we are often on the line and do not know it.”
― Elbert Hubbard
Contents
1. Introduction
The DataFrame is the most commonly used data structures in pandas. As such, it is very important to learn various specifics about working with the DataFrame. After learning various methods of creating a DataFrame, let us now delve into some methods for working with it.
2. Getting Started
Import these libraries: pandas, matplotlib for plotting and numpy.
import pandas as pd import matplotlib.pyplot as plt import numpy as np import random
If you are working with a Jupyter (or iPython) notebook and want to show graphs inline, use this definition.
%matplotlib inline
Let us now load some CSV data into our DataFrame for working with it. The data we have loaded is the World Happiness Report 2016.
x = pd.read_csv('2016.csv')
3. DataFrame Details
3.1. Index
The attribute index shows the row index labels.
x = pd.read_csv('2016.csv') print x.index # prints RangeIndex(start=0, stop=157, step=1)
The index is a RangeIndex if the labels are contiguous integers.
3.2. Columns
Get the columns using the attribute columns.
print x.columns # prints Index([u'Country', u'Region', u'Happiness Rank', u'Happiness Score', u'Lower Confidence Interval', u'Upper Confidence Interval', u'Economy (GDP per Capita)', u'Family', u'Health (Life Expectancy)', u'Freedom', u'Trust (Government Corruption)', u'Generosity', u'Dystopia Residual'], dtype='object')
3.3. Values
The raw values array can be extracted using values.
print x.values # prints [['Denmark' 'Western Europe' 1 ..., 0.44453000000000004 0.36171 2.73939] ['Switzerland' 'Western Europe' 2 ..., 0.41203 0.28083 2.69463] ['Iceland' 'Western Europe' 3 ..., 0.14975 0.47678000000000004 2.83137] ..., ['Togo' 'Sub-Saharan Africa' 155 ..., 0.11587 0.17517 2.1354] ['Syria' 'Middle East and Northern Africa' 156 ..., 0.17232999999999998 0.48396999999999996 0.81789] ['Burundi' 'Sub-Saharan Africa' 157 ..., 0.09419 0.2029 2.1040400000000004]]
3.4. Shape
Get a tuple of the number of rows and columns of the DataFrame using the shape attribute.
x.shape # prints (157, 13)
3.5. Size
Use the count() method to retrieve a count of (non-NaN) elements in each column. This method ignores any NaN elements in the column.
print x.count() # prints Country 157 Region 157 Happiness Rank 157 Happiness Score 157 Lower Confidence Interval 157 Upper Confidence Interval 157 Economy (GDP per Capita) 157 Family 157 Health (Life Expectancy) 157 Freedom 157 Trust (Government Corruption) 157 Generosity 157 Dystopia Residual 157 dtype: int64
And the size attribute returns the total number of elements (including NaNs) in the DataFrame. This means the value (nrows * ncols).
print x.size # prints 2041
3.6. Statistics
Get detailed statistics of the DataFrame using the method describe(). Returns various details such as mean, min, max, etc for each column.
print x.describe() # prints Happiness Rank Happiness Score Lower Confidence Interval \ count 157.000000 157.000000 157.000000 mean 78.980892 5.382185 5.282395 std 45.466030 1.141674 1.148043 min 1.000000 2.905000 2.732000 25% 40.000000 4.404000 4.327000 50% 79.000000 5.314000 5.237000 75% 118.000000 6.269000 6.154000 max 157.000000 7.526000 7.460000 Upper Confidence Interval Economy (GDP per Capita) Family \ count 157.000000 157.000000 157.000000 mean 5.481975 0.953880 0.793621 std 1.136493 0.412595 0.266706 ...
4. Head and Tail
The head() method retrieves the first 5 rows from the DataFrame.
x = pd.read_csv('big-data/Salaries.csv') print x.head() # prints yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 1 1985 ATL NL bedrost01 550000 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333 4 1985 ATL NL ceronri01 625000
And the tail() method retrieves the last 5 rows.
print x.tail() # prints yearID teamID lgID playerID salary 26423 2016 WSN NL strasst01 10400000 26424 2016 WSN NL taylomi02 524000 26425 2016 WSN NL treinbl01 524900 26426 2016 WSN NL werthja01 21733615 26427 2016 WSN NL zimmery01 14000000
The cumulative methods return a DataFrame with the appropriate cumulative function applied to the rows. Some of the operations are not valid for non-numeric columns.
5. Cumulative Sum
cumsum() – Cumulative sum – value of each row is replaced by sum of all prior rows including this row. String value rows use concatenation as shown below.
y = pd.DataFrame({'one': pd.Series(range(5)), 'two': pd.Series(range(5, 10)), 'three': pd.Series(list('abcde'))}) print 'head =>\n', y.head(), '\n' print 'cumsum =>\n', y.cumsum(), '\n' # prints head => one three two 0 0 a 5 1 1 b 6 2 2 c 7 3 3 d 8 4 4 e 9 cumsum => one three two 0 0 a 5 1 1 ab 11 2 3 abc 18 3 6 abcd 26 4 10 abcde 35
6. Cumulative Product
cumprod() – Cumulative product. Row value is replaced by product of all prior rows. This method is not applicable to non-numeric rows. If there are non-numeric rows in the DataFrame, you will need to extract a subset of the DataFrame as shown.
print 'cumprod =>\n', y[['one', 'two']].cumprod(), '\n' # prints cumprod => one two 0 0 5 1 0 30 2 0 210 3 0 1680 4 0 15120
7. Cumulative Maximum
cummax() – Cumulative max – value of the row is replaced by the maximum value of all prior rows till now. In the example below, for demonstrating this method, we use this method on reversed rows of the original DataFrame.
print 'rev =>\n', y.iloc[::-1], '\n', print 'cummax =>\n', y.iloc[::-1].cummax(), '\n' # prints rev => one three two 4 4 e 9 3 3 d 8 2 2 c 7 1 1 b 6 0 0 a 5 cummax => one three two 4 4 e 9 3 4 e 9 2 4 e 9 1 4 e 9 0 4 e 9
8. Cumulative Minimum
cummin() – Similar to cummax(), except computes the minimum of values till this row.
print 'cummin =>\n', y.cummin(), '\n' # prints cummin => one three two 0 0 a 5 1 0 a 5 2 0 a 5 3 0 a 5 4 0 a 5
9. Index of Min and Max Values
Use the methods idxmin() and idxmax() to obtain the index label of the rows containing minimum and maximum values. Applicable only to numeric columns, so non-numeric columns need to be filtered out.
y = pd.DataFrame({'one': pd.Series(random.sample(xrange(100), 5), index=list('abcde')), 'two': pd.Series(random.sample(xrange(100), 5), index=list('abcde')), 'three': pd.Series(list('ABCDE'), index=list('abcde'))}) print y, '\n' print 'idxmax =>\n', y[['one', 'two']].idxmax(), '\n' print 'idxmin =>\n', y[['one', 'two']].idxmin(), '\n' # prints one three two a 48 A 25 b 38 B 13 c 62 C 91 d 79 D 32 e 2 E 42 idxmax => one d two c dtype: object idxmin => one e two b dtype: object
10. Value Counts
The method value_counts() returns the number of times each value is repeated in the column. Note: this is not a DataFrame method; rather it is applicable on a column (which is a Series object).
x = pd.read_csv('big-data/Salaries.csv') print 'top 10 =>\n', x.head(10), '\n' print 'value_counts =>\n', x['yearID'].value_counts().head(10) # prints top 10 => yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 1 1985 ATL NL bedrost01 550000 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333 4 1985 ATL NL ceronri01 625000 5 1985 ATL NL chambch01 800000 6 1985 ATL NL dedmoje01 150000 7 1985 ATL NL forstte01 483333 8 1985 ATL NL garbege01 772000 9 1985 ATL NL harpete01 250000 value_counts => 1999 1006 1998 998 1995 986 1996 931 1997 925 1993 923 1994 884 1990 867 2001 860 2008 856 Name: yearID, dtype: int64
Summary
We covered a few aspects of the DataFrame in this article. Ways of learning various details of the DataFrame including size, shape, statistics, etc. were presented.