Pandas Tutorial – Grouping Examples

Demonstrates grouping of data in pandas DataFrame and compares with SQL.

“Don’t waste your time with explanations: people only hear what they want to hear.”
― Paulo Coelho

1. Introduction

Let us learn about the “grouping-by” operation in pandas. While similar to the SQL “group by”, the pandas version is much more powerful since you can use user-defined functions at various points including splitting, applying and combining results.

2. What is Grouping By?

The “grouping-by” is a tool which is used to aggregate and summarize groups within a dataset. It helps in identifying patterns within data. It is a very important operation not only in pandas but in data analysis in general. SQL databases provide a similar “GROUP BY” clause which performs a similar functionality.

3. Getting Started

We need the following imports of python modules.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

To prepare and view charts, we need the following in a Jupyter notebook.

%matplotlib inline

In some of the examples in the following sections, we compare the operation with an SQL database. The SQL database we are using is sqlite into which we load data from the same CSV file.

4. A Single Column

Here is a simple example using a single column. We load data into a DataFrame and create a GroupBy object using the groupingBy() method. The group sum is then printed using the method sum(). The sum represents total salary for each year (which is the grouping column).

x = pd.read_csv('big-data/Salaries.csv')
print x.head(), '\n'
g = x.groupby('yearID')
print 'Group sum =>\n', g.sum().head()

# prints
   yearID teamID lgID   playerID  salary
0    1985    ATL   NL  barkele01  870000
1    1985    ATL   NL  bedrost01  550000
2    1985    ATL   NL  benedbr01  545000
3    1985    ATL   NL   campri01  633333
4    1985    ATL   NL  ceronri01  625000 

Group sum =>
           salary
yearID           
1985    261964696
1986    307854518
1987    272575375
1988    300452424
1989    359995711

Here is the same operation in SQL:

select yearID, sum(salary) from salary group by yearID;

yearID      sum(salary)
----------  -----------
1985        261964696
1986        307854518
1987        272575375
1988        300452424
1989        359995711
1990        443881193
1991        613048418
...

Pretty simple, right? Since the data appears to have more groups, let us try with multiple groups.

5. Multiple Columns

We now group the data using multiple columns and run the sum() operation. The sum is now the total salary for each (year, team, league) group.

print x.head(), '\n'
g = x.groupby(['yearID', 'teamID', 'lgID'])
print 'Group sum =>\n', g.sum().head()

# prints
   yearID teamID lgID   playerID  salary
0    1985    ATL   NL  barkele01  870000
1    1985    ATL   NL  bedrost01  550000
2    1985    ATL   NL  benedbr01  545000
3    1985    ATL   NL   campri01  633333
4    1985    ATL   NL  ceronri01  625000 

Group sum =>
                      salary
yearID teamID lgID          
1985   ATL    NL    14807000
       BAL    AL    11560712
       BOS    AL    10897560
       CAL    AL    14427894
       CHA    AL     9846178

With SQL:

select yearID, teamID, lgID, sum(salary)
from salary
group by yearID, teamID, lgID
limit 5;

yearID      teamID      lgID        sum(salary)
----------  ----------  ----------  -----------
1985        ATL         NL          14807000
1985        BAL         AL          11560712
1985        BOS         AL          10897560
1985        CAL         AL          14427894
1985        CHA         AL          9846178

6. Select a Single Group

The method get_group() accepts a tuple of values and returns a DataFrame of the group members.

print x.head(), '\n'
g = x.groupby(['yearID', 'teamID', 'lgID'])
print 'Group sum =>\n', g.sum().head(), '\n'
print 'get_group =>\n', g.get_group((1985, 'ATL', 'NL')), '\n'

# prints
   yearID teamID lgID   playerID  salary
0    1985    ATL   NL  barkele01  870000
1    1985    ATL   NL  bedrost01  550000
2    1985    ATL   NL  benedbr01  545000
3    1985    ATL   NL   campri01  633333
4    1985    ATL   NL  ceronri01  625000 

Group sum =>
                      salary
yearID teamID lgID          
1985   ATL    NL    14807000
       BAL    AL    11560712
       BOS    AL    10897560
       CAL    AL    14427894
       CHA    AL     9846178 

get_group =>
     playerID   salary
0   barkele01   870000
1   bedrost01   550000
2   benedbr01   545000
3    campri01   633333
4   ceronri01   625000
...

Which is practically the same as querying the original DataFrame for the values:

print x[(x.yearID == 1985) & (x.teamID == 'ATL') & (x.lgID == 'NL')]

# prints
    yearID teamID lgID   playerID   salary
0     1985    ATL   NL  barkele01   870000
1     1985    ATL   NL  bedrost01   550000
2     1985    ATL   NL  benedbr01   545000
3     1985    ATL   NL   campri01   633333
4     1985    ATL   NL  ceronri01   625000
...

Which in SQL is a straightforward query:

select * from salary where yearID = '1985' and teamID = 'ATL' and lgID = 'NL';

yearID      teamID      lgID        playerID    salary
----------  ----------  ----------  ----------  ----------
1985        ATL         NL          barkele01   870000
1985        ATL         NL          bedrost01   550000
1985        ATL         NL          benedbr01   545000
1985        ATL         NL          campri01    633333
1985        ATL         NL          ceronri01   625000
1985        ATL         NL          chambch01   800000
...

7. Iterating Over Groups

Unlike with SQL, you can iterate over the groups once the GroupingBy object is computed. See example below for how to compute the group sum.

for name, grp in x.groupby(['yearID', 'teamID', 'lgID']):
    print name, '\n', grp

# prints
(1985, 'ATL', 'NL') 
    yearID teamID lgID   playerID   salary
0     1985    ATL   NL  barkele01   870000
1     1985    ATL   NL  bedrost01   550000
2     1985    ATL   NL  benedbr01   545000
3     1985    ATL   NL   campri01   633333
4     1985    ATL   NL  ceronri01   625000
5     1985    ATL   NL  chambch01   800000
...
(1985, 'BAL', 'AL') 
    yearID teamID lgID   playerID   salary
22    1985    BAL   AL  boddimi01   625000
23    1985    BAL   AL  dauerri01   480000
24    1985    BAL   AL  davisst02   437500
25    1985    BAL   AL  dempsri01   512500
...

7.1. How to Compute a Group Aggregate

One example of using the group iteration is to compute a group aggregate. For example, here is how you can compute a sum of a field over the group.

Note that most common aggregate functions are already provided as methods of the group object, including the sum().  This example merely illustrates one way to apply an aggregate in case you want to do something special with the groups.

n = 0
for name, grp in g:
    sal = 0
    for row in grp.itertuples():
        sal += row.salary
    print name, '=>', sal
    n += 1
    if n == 10: break

# equivalent
print g.sum().head(10)

# prints
(1985, 'ATL', 'NL') => 14807000
(1985, 'BAL', 'AL') => 11560712
(1985, 'BOS', 'AL') => 10897560
(1985, 'CAL', 'AL') => 14427894
(1985, 'CHA', 'AL') => 9846178
(1985, 'CHN', 'NL') => 12702917
(1985, 'CIN', 'NL') => 8359917
(1985, 'CLE', 'AL') => 6551666
(1985, 'DET', 'AL') => 10348143
(1985, 'HOU', 'NL') => 9993051

                      salary
yearID teamID lgID          
1985   ATL    NL    14807000
       BAL    AL    11560712
       BOS    AL    10897560
       CAL    AL    14427894
       CHA    AL     9846178
       CHN    NL    12702917
       CIN    NL     8359917
       CLE    AL     6551666
       DET    AL    10348143
       HOU    NL     9993051

8. Aggregate Functions

The GroupBy object supports several aggregate functions. Some of the common ones are listed below.

8.1. sum() – Sum groups

print 'data =>\n', y, '\n'
print 'sum =>\n', g.sum(), '\n'

# prints
data =>
    yearID teamID lgID   playerID   salary
0     1985    ATL   NL  barkele01   870000
1     1985    ATL   NL  bedrost01   550000
2     1985    ATL   NL  benedbr01   545000
3     1985    ATL   NL   campri01   633333
4     1985    ATL   NL  ceronri01   625000
5     1985    ATL   NL  chambch01   800000
6     1985    ATL   NL  dedmoje01   150000
7     1985    ATL   NL  forstte01   483333
8     1985    ATL   NL  garbege01   772000
9     1985    ATL   NL  harpete01   250000
10    1985    ATL   NL  hornebo01  1500000
11    1985    ATL   NL  hubbagl01   455000
12    1985    ATL   NL  mahleri01   407500
13    1985    ATL   NL  mcmurcr01   275000
14    1985    ATL   NL  mumphje01   775000
15    1985    ATL   NL  murphda05  1625000
16    1985    ATL   NL  oberkke01   616667
17    1985    ATL   NL  perezpa01   450000
18    1985    ATL   NL  perryge01   120000
19    1985    ATL   NL  ramirra01   750000
20    1985    ATL   NL  suttebr01  1354167
21    1985    ATL   NL  washicl01   800000
22    1985    BAL   AL  boddimi01   625000
23    1985    BAL   AL  dauerri01   480000
24    1985    BAL   AL  davisst02   437500 

sum =>
                      salary
yearID teamID lgID          
1985   ATL    NL    14807000
       BAL    AL     1542500

8.2. mean() – Mean of the groups

print 'mean =>\n', g.mean(), '\n'

# prints
mean =>
                    salary
yearID teamID lgID        
1985   ATL    NL    673045
       BAL    AL    514166

8.3. median() – Median of the groups

print 'median =>\n', g.median(), '\n'

# prints
median =>
                    salary
yearID teamID lgID        
1985   ATL    NL    620833
       BAL    AL    480000

8.4. std() – Standard Deviation of groups

print 'std =>\n', g.std(), '\n'

# prints
std =>
                           salary
yearID teamID lgID               
1985   ATL    NL    397035.091114
       BAL    AL     98308.612710

8.5. sem() – Standard Error of Mean of the groups

print 'sem =>\n', g.sem(), '\n'

# prints
sem =>
                    playerID        salary
yearID teamID lgID                        
1985   ATL    NL         NaN  84648.165844
       BAL    AL         NaN  56758.504012

8.6. size() – Group sizes

print 'size =>\n', g.size(), '\n'

# prints
size =>
yearID  teamID  lgID
1985    ATL     NL      22
        BAL     AL       3

8.7. var() – Variance of the groups

print 'var =>\n', g.var(), '\n'

# prints
var =>
                          salary
yearID teamID lgID              
1985   ATL    NL    157636863576
       BAL    AL      9664583333

Summary

In this article, we learnt about grouping data using pandas DataFrame. Compared with SQL, this is a more powerful functionality since you can use user-defined functions, and also have control over each step of the grouping. Aggregates can be computed using readily available functions such as sum(), mean, std(), etc.