Pandas Tutorial – DataFrame Basics

Learn the basics of working with a DataFrame in this pandas tutorial.

pandas DataFrame Basics

“The line between failure and success is so fine. . . that we are often on the line and do not know it.”
― Elbert Hubbard

1. Introduction

The DataFrame is the most commonly used data structures in pandas. As such, it is very important to learn various specifics about working with the DataFrame. After learning various methods of creating a DataFrame, let us now delve into some methods for working with it.

2. Getting Started

Import these libraries: pandas, matplotlib for plotting and numpy.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random

If you are working with a Jupyter (or iPython) notebook and want to show graphs inline, use this definition.

%matplotlib inline

Let us now load some CSV data into our DataFrame for working with it. The data we have loaded is the World Happiness Report 2016.

x = pd.read_csv('2016.csv')

3. DataFrame Details

3.1. Index

The attribute index shows the row index labels.

x = pd.read_csv('2016.csv')
print x.index

# prints
RangeIndex(start=0, stop=157, step=1)

The index is a RangeIndex if the labels are contiguous integers.

3.2. Columns

Get the columns using the attribute columns.

print x.columns

# prints
Index([u'Country', u'Region', u'Happiness Rank', u'Happiness Score',
       u'Lower Confidence Interval', u'Upper Confidence Interval',
       u'Economy (GDP per Capita)', u'Family', u'Health (Life Expectancy)',
       u'Freedom', u'Trust (Government Corruption)', u'Generosity',
       u'Dystopia Residual'],
      dtype='object')

3.3. Values

The raw values array can be extracted using values.

print x.values

# prints
[['Denmark' 'Western Europe' 1 ..., 0.44453000000000004 0.36171 2.73939]
 ['Switzerland' 'Western Europe' 2 ..., 0.41203 0.28083 2.69463]
 ['Iceland' 'Western Europe' 3 ..., 0.14975 0.47678000000000004 2.83137]
 ..., 
 ['Togo' 'Sub-Saharan Africa' 155 ..., 0.11587 0.17517 2.1354]
 ['Syria' 'Middle East and Northern Africa' 156 ..., 0.17232999999999998
  0.48396999999999996 0.81789]
 ['Burundi' 'Sub-Saharan Africa' 157 ..., 0.09419 0.2029 2.1040400000000004]]

3.4. Shape

Get a tuple of the number of rows and columns of the DataFrame using the shape attribute.

x.shape

# prints
(157, 13)

3.5. Size

Use the count() method to retrieve a count of (non-NaN) elements in each column. This method ignores any NaN elements in the column.

print x.count()

# prints
Country                          157
Region                           157
Happiness Rank                   157
Happiness Score                  157
Lower Confidence Interval        157
Upper Confidence Interval        157
Economy (GDP per Capita)         157
Family                           157
Health (Life Expectancy)         157
Freedom                          157
Trust (Government Corruption)    157
Generosity                       157
Dystopia Residual                157
dtype: int64

And the size attribute returns the total number of elements (including NaNs) in the DataFrame. This means the value (nrows * ncols).

print x.size

# prints
2041

3.6. Statistics

Get detailed statistics of the DataFrame using the method describe(). Returns various details such as mean, min, max, etc for each column.

print x.describe()

# prints
       Happiness Rank  Happiness Score  Lower Confidence Interval  \
count      157.000000       157.000000                 157.000000   
mean        78.980892         5.382185                   5.282395   
std         45.466030         1.141674                   1.148043   
min          1.000000         2.905000                   2.732000   
25%         40.000000         4.404000                   4.327000   
50%         79.000000         5.314000                   5.237000   
75%        118.000000         6.269000                   6.154000   
max        157.000000         7.526000                   7.460000   

       Upper Confidence Interval  Economy (GDP per Capita)      Family  \
count                 157.000000                157.000000  157.000000   
mean                    5.481975                  0.953880    0.793621   
std                     1.136493                  0.412595    0.266706   
...

4. Head and Tail

The head() method retrieves the first 5 rows from the DataFrame.

x = pd.read_csv('big-data/Salaries.csv')
print x.head()

# prints
   yearID teamID lgID   playerID  salary
0    1985    ATL   NL  barkele01  870000
1    1985    ATL   NL  bedrost01  550000
2    1985    ATL   NL  benedbr01  545000
3    1985    ATL   NL   campri01  633333
4    1985    ATL   NL  ceronri01  625000

And the tail() method retrieves the last 5 rows.

print x.tail()

# prints
       yearID teamID lgID   playerID    salary
26423    2016    WSN   NL  strasst01  10400000
26424    2016    WSN   NL  taylomi02    524000
26425    2016    WSN   NL  treinbl01    524900
26426    2016    WSN   NL  werthja01  21733615
26427    2016    WSN   NL  zimmery01  14000000

The cumulative methods return a DataFrame with the appropriate cumulative function applied to the rows. Some of the operations are not valid for non-numeric columns.

5. Cumulative Sum

cumsum() – Cumulative sum – value of each row is replaced by sum of all prior rows including this row. String value rows use concatenation as shown below.

y = pd.DataFrame({'one': pd.Series(range(5)),
                 'two': pd.Series(range(5, 10)),
                 'three': pd.Series(list('abcde'))})
print 'head =>\n', y.head(), '\n'
print 'cumsum =>\n', y.cumsum(), '\n'

# prints
head =>
   one three  two
0    0     a    5
1    1     b    6
2    2     c    7
3    3     d    8
4    4     e    9 

cumsum =>
  one  three two
0   0      a   5
1   1     ab  11
2   3    abc  18
3   6   abcd  26
4  10  abcde  35

6. Cumulative Product

cumprod() – Cumulative product. Row value is replaced by product of all prior rows. This method is not applicable to non-numeric rows. If there are non-numeric rows in the DataFrame, you will need to extract a subset of the DataFrame as shown.

print 'cumprod =>\n', y[['one', 'two']].cumprod(), '\n'

# prints
cumprod =>
   one    two
0    0      5
1    0     30
2    0    210
3    0   1680
4    0  15120

7. Cumulative Maximum

cummax() – Cumulative max – value of the row is replaced by the maximum value of all prior rows till now. In the example below, for demonstrating this method, we use this method on reversed rows of the original DataFrame.

print 'rev =>\n', y.iloc[::-1], '\n',
print 'cummax =>\n', y.iloc[::-1].cummax(), '\n'

# prints
rev =>
   one three  two
4    4     e    9
3    3     d    8
2    2     c    7
1    1     b    6
0    0     a    5 
cummax =>
  one three two
4   4     e   9
3   4     e   9
2   4     e   9
1   4     e   9
0   4     e   9

8. Cumulative Minimum

cummin() – Similar to cummax(), except computes the minimum of values till this row.

print 'cummin =>\n', y.cummin(), '\n'

# prints
cummin =>
  one three two
0   0     a   5
1   0     a   5
2   0     a   5
3   0     a   5
4   0     a   5

9. Index of Min and Max Values

Use the methods idxmin() and idxmax() to obtain the index label of the rows containing minimum and maximum values. Applicable only to numeric columns, so non-numeric columns need to be filtered out.

y = pd.DataFrame({'one': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
                 'two': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
                 'three': pd.Series(list('ABCDE'), index=list('abcde'))})
print y, '\n'
print 'idxmax =>\n', y[['one', 'two']].idxmax(), '\n'
print 'idxmin =>\n', y[['one', 'two']].idxmin(), '\n'

# prints
   one three  two
a   48     A   25
b   38     B   13
c   62     C   91
d   79     D   32
e    2     E   42 

idxmax =>
one    d
two    c
dtype: object 

idxmin =>
one    e
two    b
dtype: object

10. Value Counts

The method value_counts() returns the number of times each value is repeated in the column. Note: this is not a DataFrame method; rather it is applicable on a column (which is a Series object).

x = pd.read_csv('big-data/Salaries.csv')
print 'top 10 =>\n', x.head(10), '\n'
print 'value_counts =>\n', x['yearID'].value_counts().head(10)

# prints
top 10 =>
   yearID teamID lgID   playerID  salary
0    1985    ATL   NL  barkele01  870000
1    1985    ATL   NL  bedrost01  550000
2    1985    ATL   NL  benedbr01  545000
3    1985    ATL   NL   campri01  633333
4    1985    ATL   NL  ceronri01  625000
5    1985    ATL   NL  chambch01  800000
6    1985    ATL   NL  dedmoje01  150000
7    1985    ATL   NL  forstte01  483333
8    1985    ATL   NL  garbege01  772000
9    1985    ATL   NL  harpete01  250000 

value_counts =>
1999    1006
1998     998
1995     986
1996     931
1997     925
1993     923
1994     884
1990     867
2001     860
2008     856
Name: yearID, dtype: int64

Summary

We covered a few aspects of the DataFrame in this article. Ways of learning various details of the DataFrame including size, shape, statistics, etc. were presented.