“Don’t waste your time with explanations: people only hear what they want to hear.”
― Paulo Coelho
Contents
1. Introduction
Let us learn about the “grouping-by” operation in pandas. While similar to the SQL “group by”, the pandas version is much more powerful since you can use user-defined functions at various points including splitting, applying and combining results.
2. What is Grouping By?
The “grouping-by” is a tool which is used to aggregate and summarize groups within a dataset. It helps in identifying patterns within data. It is a very important operation not only in pandas but in data analysis in general. SQL databases provide a similar “GROUP BY” clause which performs a similar functionality.
3. Getting Started
We need the following imports of python modules.
import pandas as pd import matplotlib.pyplot as plt import numpy as np
To prepare and view charts, we need the following in a Jupyter notebook.
%matplotlib inline
In some of the examples in the following sections, we compare the operation with an SQL database. The SQL database we are using is sqlite into which we load data from the same CSV file.
4. A Single Column
Here is a simple example using a single column. We load data into a DataFrame and create a GroupBy object using the groupingBy() method. The group sum is then printed using the method sum(). The sum represents total salary for each year (which is the grouping column).
x = pd.read_csv('big-data/Salaries.csv')
print x.head(), '\n'
g = x.groupby('yearID')
print 'Group sum =>\n', g.sum().head()
# prints
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
Group sum =>
salary
yearID
1985 261964696
1986 307854518
1987 272575375
1988 300452424
1989 359995711
Here is the same operation in SQL:
select yearID, sum(salary) from salary group by yearID; yearID sum(salary) ---------- ----------- 1985 261964696 1986 307854518 1987 272575375 1988 300452424 1989 359995711 1990 443881193 1991 613048418 ...
Pretty simple, right? Since the data appears to have more groups, let us try with multiple groups.
5. Multiple Columns
We now group the data using multiple columns and run the sum() operation. The sum is now the total salary for each (year, team, league) group.
print x.head(), '\n'
g = x.groupby(['yearID', 'teamID', 'lgID'])
print 'Group sum =>\n', g.sum().head()
# prints
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
Group sum =>
salary
yearID teamID lgID
1985 ATL NL 14807000
BAL AL 11560712
BOS AL 10897560
CAL AL 14427894
CHA AL 9846178
With SQL:
select yearID, teamID, lgID, sum(salary) from salary group by yearID, teamID, lgID limit 5; yearID teamID lgID sum(salary) ---------- ---------- ---------- ----------- 1985 ATL NL 14807000 1985 BAL AL 11560712 1985 BOS AL 10897560 1985 CAL AL 14427894 1985 CHA AL 9846178
6. Select a Single Group
The method get_group() accepts a tuple of values and returns a DataFrame of the group members.
print x.head(), '\n'
g = x.groupby(['yearID', 'teamID', 'lgID'])
print 'Group sum =>\n', g.sum().head(), '\n'
print 'get_group =>\n', g.get_group((1985, 'ATL', 'NL')), '\n'
# prints
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
Group sum =>
salary
yearID teamID lgID
1985 ATL NL 14807000
BAL AL 11560712
BOS AL 10897560
CAL AL 14427894
CHA AL 9846178
get_group =>
playerID salary
0 barkele01 870000
1 bedrost01 550000
2 benedbr01 545000
3 campri01 633333
4 ceronri01 625000
...
Which is practically the same as querying the original DataFrame for the values:
print x[(x.yearID == 1985) & (x.teamID == 'ATL') & (x.lgID == 'NL')]
# prints
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
...
Which in SQL is a straightforward query:
select * from salary where yearID = '1985' and teamID = 'ATL' and lgID = 'NL'; yearID teamID lgID playerID salary ---------- ---------- ---------- ---------- ---------- 1985 ATL NL barkele01 870000 1985 ATL NL bedrost01 550000 1985 ATL NL benedbr01 545000 1985 ATL NL campri01 633333 1985 ATL NL ceronri01 625000 1985 ATL NL chambch01 800000 ...
7. Iterating Over Groups
Unlike with SQL, you can iterate over the groups once the GroupingBy object is computed. See example below for how to compute the group sum.
for name, grp in x.groupby(['yearID', 'teamID', 'lgID']):
print name, '\n', grp
# prints
(1985, 'ATL', 'NL')
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
5 1985 ATL NL chambch01 800000
...
(1985, 'BAL', 'AL')
yearID teamID lgID playerID salary
22 1985 BAL AL boddimi01 625000
23 1985 BAL AL dauerri01 480000
24 1985 BAL AL davisst02 437500
25 1985 BAL AL dempsri01 512500
...
7.1. How to Compute a Group Aggregate
One example of using the group iteration is to compute a group aggregate. For example, here is how you can compute a sum of a field over the group.
Note that most common aggregate functions are already provided as methods of the group object, including the sum(). This example merely illustrates one way to apply an aggregate in case you want to do something special with the groups.
n = 0
for name, grp in g:
sal = 0
for row in grp.itertuples():
sal += row.salary
print name, '=>', sal
n += 1
if n == 10: break
# equivalent
print g.sum().head(10)
# prints
(1985, 'ATL', 'NL') => 14807000
(1985, 'BAL', 'AL') => 11560712
(1985, 'BOS', 'AL') => 10897560
(1985, 'CAL', 'AL') => 14427894
(1985, 'CHA', 'AL') => 9846178
(1985, 'CHN', 'NL') => 12702917
(1985, 'CIN', 'NL') => 8359917
(1985, 'CLE', 'AL') => 6551666
(1985, 'DET', 'AL') => 10348143
(1985, 'HOU', 'NL') => 9993051
salary
yearID teamID lgID
1985 ATL NL 14807000
BAL AL 11560712
BOS AL 10897560
CAL AL 14427894
CHA AL 9846178
CHN NL 12702917
CIN NL 8359917
CLE AL 6551666
DET AL 10348143
HOU NL 9993051
8. Aggregate Functions
The GroupBy object supports several aggregate functions. Some of the common ones are listed below.
8.1. sum() – Sum groups
print 'data =>\n', y, '\n'
print 'sum =>\n', g.sum(), '\n'
# prints
data =>
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
5 1985 ATL NL chambch01 800000
6 1985 ATL NL dedmoje01 150000
7 1985 ATL NL forstte01 483333
8 1985 ATL NL garbege01 772000
9 1985 ATL NL harpete01 250000
10 1985 ATL NL hornebo01 1500000
11 1985 ATL NL hubbagl01 455000
12 1985 ATL NL mahleri01 407500
13 1985 ATL NL mcmurcr01 275000
14 1985 ATL NL mumphje01 775000
15 1985 ATL NL murphda05 1625000
16 1985 ATL NL oberkke01 616667
17 1985 ATL NL perezpa01 450000
18 1985 ATL NL perryge01 120000
19 1985 ATL NL ramirra01 750000
20 1985 ATL NL suttebr01 1354167
21 1985 ATL NL washicl01 800000
22 1985 BAL AL boddimi01 625000
23 1985 BAL AL dauerri01 480000
24 1985 BAL AL davisst02 437500
sum =>
salary
yearID teamID lgID
1985 ATL NL 14807000
BAL AL 1542500
8.2. mean() – Mean of the groups
print 'mean =>\n', g.mean(), '\n'
# prints
mean =>
salary
yearID teamID lgID
1985 ATL NL 673045
BAL AL 514166
8.3. median() – Median of the groups
print 'median =>\n', g.median(), '\n'
# prints
median =>
salary
yearID teamID lgID
1985 ATL NL 620833
BAL AL 480000
8.4. std() – Standard Deviation of groups
print 'std =>\n', g.std(), '\n'
# prints
std =>
salary
yearID teamID lgID
1985 ATL NL 397035.091114
BAL AL 98308.612710
8.5. sem() – Standard Error of Mean of the groups
print 'sem =>\n', g.sem(), '\n'
# prints
sem =>
playerID salary
yearID teamID lgID
1985 ATL NL NaN 84648.165844
BAL AL NaN 56758.504012
8.6. size() – Group sizes
print 'size =>\n', g.size(), '\n'
# prints
size =>
yearID teamID lgID
1985 ATL NL 22
BAL AL 3
8.7. var() – Variance of the groups
print 'var =>\n', g.var(), '\n'
# prints
var =>
salary
yearID teamID lgID
1985 ATL NL 157636863576
BAL AL 9664583333
Summary
In this article, we learnt about grouping data using pandas DataFrame. Compared with SQL, this is a more powerful functionality since you can use user-defined functions, and also have control over each step of the grouping. Aggregates can be computed using readily available functions such as sum(), mean, std(), etc.
Taking one of your groups for instance…
(1985, ‘ATL’, ‘NL’)
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
5 1985 ATL NL chambch01 800000
In this case how would one go about using pandas to put all players on a team in a list so the output would be:
yearID teamID lgID playerID
1985 ATL NL [barkele01, bedrost01, benedbr01, campri01, ceronri01, chambch01]