“Don’t waste your time with explanations: people only hear what they want to hear.”
― Paulo Coelho
Contents
1. Introduction
Let us learn about the “grouping-by” operation in pandas. While similar to the SQL “group by”, the pandas version is much more powerful since you can use user-defined functions at various points including splitting, applying and combining results.
2. What is Grouping By?
The “grouping-by” is a tool which is used to aggregate and summarize groups within a dataset. It helps in identifying patterns within data. It is a very important operation not only in pandas but in data analysis in general. SQL databases provide a similar “GROUP BY” clause which performs a similar functionality.
3. Getting Started
We need the following imports of python modules.
import pandas as pd import matplotlib.pyplot as plt import numpy as np
To prepare and view charts, we need the following in a Jupyter notebook.
%matplotlib inline
In some of the examples in the following sections, we compare the operation with an SQL database. The SQL database we are using is sqlite into which we load data from the same CSV file.
4. A Single Column
Here is a simple example using a single column. We load data into a DataFrame and create a GroupBy object using the groupingBy() method. The group sum is then printed using the method sum(). The sum represents total salary for each year (which is the grouping column).
x = pd.read_csv('big-data/Salaries.csv') print x.head(), '\n' g = x.groupby('yearID') print 'Group sum =>\n', g.sum().head() # prints yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 1 1985 ATL NL bedrost01 550000 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333 4 1985 ATL NL ceronri01 625000 Group sum => salary yearID 1985 261964696 1986 307854518 1987 272575375 1988 300452424 1989 359995711
Here is the same operation in SQL:
select yearID, sum(salary) from salary group by yearID; yearID sum(salary) ---------- ----------- 1985 261964696 1986 307854518 1987 272575375 1988 300452424 1989 359995711 1990 443881193 1991 613048418 ...
Pretty simple, right? Since the data appears to have more groups, let us try with multiple groups.
5. Multiple Columns
We now group the data using multiple columns and run the sum() operation. The sum is now the total salary for each (year, team, league) group.
print x.head(), '\n' g = x.groupby(['yearID', 'teamID', 'lgID']) print 'Group sum =>\n', g.sum().head() # prints yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 1 1985 ATL NL bedrost01 550000 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333 4 1985 ATL NL ceronri01 625000 Group sum => salary yearID teamID lgID 1985 ATL NL 14807000 BAL AL 11560712 BOS AL 10897560 CAL AL 14427894 CHA AL 9846178
With SQL:
select yearID, teamID, lgID, sum(salary) from salary group by yearID, teamID, lgID limit 5; yearID teamID lgID sum(salary) ---------- ---------- ---------- ----------- 1985 ATL NL 14807000 1985 BAL AL 11560712 1985 BOS AL 10897560 1985 CAL AL 14427894 1985 CHA AL 9846178
6. Select a Single Group
The method get_group() accepts a tuple of values and returns a DataFrame of the group members.
print x.head(), '\n' g = x.groupby(['yearID', 'teamID', 'lgID']) print 'Group sum =>\n', g.sum().head(), '\n' print 'get_group =>\n', g.get_group((1985, 'ATL', 'NL')), '\n' # prints yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 1 1985 ATL NL bedrost01 550000 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333 4 1985 ATL NL ceronri01 625000 Group sum => salary yearID teamID lgID 1985 ATL NL 14807000 BAL AL 11560712 BOS AL 10897560 CAL AL 14427894 CHA AL 9846178 get_group => playerID salary 0 barkele01 870000 1 bedrost01 550000 2 benedbr01 545000 3 campri01 633333 4 ceronri01 625000 ...
Which is practically the same as querying the original DataFrame for the values:
print x[(x.yearID == 1985) & (x.teamID == 'ATL') & (x.lgID == 'NL')] # prints yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 1 1985 ATL NL bedrost01 550000 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333 4 1985 ATL NL ceronri01 625000 ...
Which in SQL is a straightforward query:
select * from salary where yearID = '1985' and teamID = 'ATL' and lgID = 'NL'; yearID teamID lgID playerID salary ---------- ---------- ---------- ---------- ---------- 1985 ATL NL barkele01 870000 1985 ATL NL bedrost01 550000 1985 ATL NL benedbr01 545000 1985 ATL NL campri01 633333 1985 ATL NL ceronri01 625000 1985 ATL NL chambch01 800000 ...
7. Iterating Over Groups
Unlike with SQL, you can iterate over the groups once the GroupingBy object is computed. See example below for how to compute the group sum.
for name, grp in x.groupby(['yearID', 'teamID', 'lgID']): print name, '\n', grp # prints (1985, 'ATL', 'NL') yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 1 1985 ATL NL bedrost01 550000 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333 4 1985 ATL NL ceronri01 625000 5 1985 ATL NL chambch01 800000 ... (1985, 'BAL', 'AL') yearID teamID lgID playerID salary 22 1985 BAL AL boddimi01 625000 23 1985 BAL AL dauerri01 480000 24 1985 BAL AL davisst02 437500 25 1985 BAL AL dempsri01 512500 ...
7.1. How to Compute a Group Aggregate
One example of using the group iteration is to compute a group aggregate. For example, here is how you can compute a sum of a field over the group.
Note that most common aggregate functions are already provided as methods of the group object, including the sum(). This example merely illustrates one way to apply an aggregate in case you want to do something special with the groups.
n = 0 for name, grp in g: sal = 0 for row in grp.itertuples(): sal += row.salary print name, '=>', sal n += 1 if n == 10: break # equivalent print g.sum().head(10) # prints (1985, 'ATL', 'NL') => 14807000 (1985, 'BAL', 'AL') => 11560712 (1985, 'BOS', 'AL') => 10897560 (1985, 'CAL', 'AL') => 14427894 (1985, 'CHA', 'AL') => 9846178 (1985, 'CHN', 'NL') => 12702917 (1985, 'CIN', 'NL') => 8359917 (1985, 'CLE', 'AL') => 6551666 (1985, 'DET', 'AL') => 10348143 (1985, 'HOU', 'NL') => 9993051 salary yearID teamID lgID 1985 ATL NL 14807000 BAL AL 11560712 BOS AL 10897560 CAL AL 14427894 CHA AL 9846178 CHN NL 12702917 CIN NL 8359917 CLE AL 6551666 DET AL 10348143 HOU NL 9993051
8. Aggregate Functions
The GroupBy object supports several aggregate functions. Some of the common ones are listed below.
8.1. sum() – Sum groups
print 'data =>\n', y, '\n' print 'sum =>\n', g.sum(), '\n' # prints data => yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 1 1985 ATL NL bedrost01 550000 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333 4 1985 ATL NL ceronri01 625000 5 1985 ATL NL chambch01 800000 6 1985 ATL NL dedmoje01 150000 7 1985 ATL NL forstte01 483333 8 1985 ATL NL garbege01 772000 9 1985 ATL NL harpete01 250000 10 1985 ATL NL hornebo01 1500000 11 1985 ATL NL hubbagl01 455000 12 1985 ATL NL mahleri01 407500 13 1985 ATL NL mcmurcr01 275000 14 1985 ATL NL mumphje01 775000 15 1985 ATL NL murphda05 1625000 16 1985 ATL NL oberkke01 616667 17 1985 ATL NL perezpa01 450000 18 1985 ATL NL perryge01 120000 19 1985 ATL NL ramirra01 750000 20 1985 ATL NL suttebr01 1354167 21 1985 ATL NL washicl01 800000 22 1985 BAL AL boddimi01 625000 23 1985 BAL AL dauerri01 480000 24 1985 BAL AL davisst02 437500 sum => salary yearID teamID lgID 1985 ATL NL 14807000 BAL AL 1542500
8.2. mean() – Mean of the groups
print 'mean =>\n', g.mean(), '\n' # prints mean => salary yearID teamID lgID 1985 ATL NL 673045 BAL AL 514166
8.3. median() – Median of the groups
print 'median =>\n', g.median(), '\n' # prints median => salary yearID teamID lgID 1985 ATL NL 620833 BAL AL 480000
8.4. std() – Standard Deviation of groups
print 'std =>\n', g.std(), '\n' # prints std => salary yearID teamID lgID 1985 ATL NL 397035.091114 BAL AL 98308.612710
8.5. sem() – Standard Error of Mean of the groups
print 'sem =>\n', g.sem(), '\n' # prints sem => playerID salary yearID teamID lgID 1985 ATL NL NaN 84648.165844 BAL AL NaN 56758.504012
8.6. size() – Group sizes
print 'size =>\n', g.size(), '\n' # prints size => yearID teamID lgID 1985 ATL NL 22 BAL AL 3
8.7. var() – Variance of the groups
print 'var =>\n', g.var(), '\n' # prints var => salary yearID teamID lgID 1985 ATL NL 157636863576 BAL AL 9664583333
Summary
In this article, we learnt about grouping data using pandas DataFrame. Compared with SQL, this is a more powerful functionality since you can use user-defined functions, and also have control over each step of the grouping. Aggregates can be computed using readily available functions such as sum(), mean, std(), etc.
Taking one of your groups for instance…
(1985, ‘ATL’, ‘NL’)
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
5 1985 ATL NL chambch01 800000
In this case how would one go about using pandas to put all players on a team in a list so the output would be:
yearID teamID lgID playerID
1985 ATL NL [barkele01, bedrost01, benedbr01, campri01, ceronri01, chambch01]