“Always keep your words soft and sweet, just in case you have to eat them.”
― Andy Rooney
Contents
1. Introduction
In this article, we present SQL-like ways of selecting data from a pandas DataFrame. The SELECT clause is very familiar to database programmers for accessing data within an SQL database. The DataFrame provides similar functionality when working with datasets, but is far more powerful since it supports using predicate functions with a simple syntax.
You may want to review these articles for an introduction to the pandas DataFrame:
2. Boolean Operators
These are the boolean operators. We draw special attention to them here since they are different from normal python boolean operators and are applicable only with DataFrame and Series selection.
These operators must be grouped using parentheses.
Operator | Meaning |
& | And |
| | Or |
~ | Not |
3. Simple Expression
The following example shows an example of a simple selection using a boolean expression.
x = pd.read_csv('big-data/Salaries.csv') print 'head =>\n', x.head(), '\n' a = x['salary'] print 'simple expr =>\n', a[a < 60000] # prints head => yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 1 1985 ATL NL bedrost01 550000 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333 4 1985 ATL NL ceronri01 625000 simple expr => 6179 0 6194 10900 6659 50000 9679 50000 12007 0 Name: salary, dtype: int64
The above showed an example of boolean selection with a Series (by selecting a column “salary
“). Here is how you can do the same with a DataFrame.
print 'data frame =>\n', x[x.salary < 60000] # prints data frame => yearID teamID lgID playerID salary 6179 1993 NYA AL jamesdi01 0 6194 1993 NYA AL silveda01 10900 6659 1994 CHA AL carych01 50000 9679 1997 FLO NL penaal01 50000 12007 1999 PIT NL martija02 0
This is similar to the following SQL statement:
select * from salary where salary < 60000; yearID teamID lgID playerID salary ---------- ---------- ---------- ---------- ---------- 1993 NYA AL jamesdi01 0 1993 NYA AL silveda01 10900 1994 CHA AL carych01 50000 1997 FLO NL penaal01 50000 1999 PIT NL martija02 0
4. Using Boolean Series
Boolean expression selection works as follows: Evaluating a boolean condition results in a Series of boolean values which is used to select rows with True values. Check out this example.
print 'boolean series =>\n', (x.salary < 60000).head(), '\n' print 'value counts =>\n', (x.salary < 60000).value_counts(), '\n' # prints boolean series => 0 False 1 False 2 False 3 False 4 False Name: salary, dtype: bool value counts => False 26423 True 5 Name: salary, dtype: int64
5. Compound Expressions
Combine boolean conditions by enclosing each expression in parentheses and using one of the operators listed in the above table.
print x[(x.salary < 60000) & (x.teamID == 'NYA')] # prints yearID teamID lgID playerID salary 6179 1993 NYA AL jamesdi01 0 6194 1993 NYA AL silveda01 10900
Compare with SQL:
select * from salary where salary < 60000 and teamID == "NYA"; yearID teamID lgID playerID salary ---------- ---------- ---------- ---------- ---------- 1993 NYA AL jamesdi01 0 1993 NYA AL silveda01 10900
6. Using a Function or Lambda
Another method is to use a function to select the required data.
print x[lambda a : a.salary < 60000] # prints yearID teamID lgID playerID salary 6179 1993 NYA AL jamesdi01 0 6194 1993 NYA AL silveda01 10900 6659 1994 CHA AL carych01 50000 9679 1997 FLO NL penaal01 50000 12007 1999 PIT NL martija02 0
This is something you cannot do directly and quickly in SQL (when complex conditions are evaluated in the lambda). Maybe stored procedures, but that is whole another beast from SQL.
Again, combining boolean expressions in normal pythonic way wont work. It results in an error. (BTW, the error message could have been more meaningful).
print x[lambda a : a.salary > 500000 and a.salary < 600000] # Results in: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Even in a lambda, you need to combine boolean expressions using the operators listed above.
print x[lambda a : (a.salary > 60000) & (a.salary < 62500)] # prints yearID teamID lgID playerID salary 888 1986 LAN NL powelde01 62000 971 1986 MON NL sebrabo01 61500
7. Boolean Negation
Negate a boolean expression using ~
.
print x[~((x.salary < 50000) & (x.teamID == 'NYA'))] # prints yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 1 1985 ATL NL bedrost01 550000 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333 ...
8. Membership Check
Want to check if elements of a Series contains these values? Use a membership check using isin().
8.1. Series Example
print 'a sample =>\n', x['salary'].head(10), '\n' print 'isin =>\n', x['salary'].isin([550000, 772000, 250000, 900]).head(10), '\n' # prints a sample => 0 870000 1 550000 2 545000 3 633333 4 625000 5 800000 6 150000 7 483333 8 772000 9 250000 Name: salary, dtype: int64 isin => 0 False 1 True 2 False 3 False 4 False 5 False 6 False 7 False 8 True 9 True Name: salary, dtype: bool
8.2. DataFrame Example
Use the boolean Series returned from isin() to select from the DataFrame.
print 'membership check =>\n', x[x['salary'].isin([550000, 772000, 250000, 900])].head() # prints membership check => yearID teamID lgID playerID salary 1 1985 ATL NL bedrost01 550000 8 1985 ATL NL garbege01 772000 9 1985 ATL NL harpete01 250000 96 1985 CAL AL zahnge01 550000 121 1985 CHN NL davisjo02 550000
Here is the SQL equivalent:
select * from salary where salary in (550000, 772000, 250000, 900) limit 5; yearID teamID lgID playerID salary ---------- ---------- ---------- ---------- ---------- 1985 ATL NL bedrost01 550000 1985 ATL NL garbege01 772000 1985 ATL NL harpete01 250000 1985 CAL AL zahnge01 550000 1985 CHN NL davisjo02 550000
8.3. Using on Index Column
The same method can be used on an index to select based on label membership too.
print x[x.index.isin([0, 100, 1000, 10000, 100000])] # prints yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 100 1985 CHA AL burnsbr01 727500 1000 1986 NYA AL pasquda01 70000 10000 1997 PHI NL jordake01 175000
(No simple way to do the above in SQL since it requires the use of a ROW_NUMBER concept which not all SQL databases support.)
The method isin() can be directly used on a DataFrame by specifying the values to search for in a dict.
print x.isin({'yearID': [1985], 'teamID': ['ATL','CHA'], 'lgID': ['NL']}) # prints ... 16 True True True False False 17 True True True False False 18 True True True False False 19 True True True False False 20 True True True False False 21 True True True False False 22 True False False False False 23 True False False False False 24 True False False False False ...
9. Compare Using where()
While using a boolean expression to select data returns only the data which matches, using the same expression in a where() returns a DataFrame which is the same size and shape as the original. Only the rows which match contain the original data, the rest being NaNs.
print 'data frame =>\n', x[x.salary < 60000] print 'where =>\n', x.where(x.salary < 60000)[6175:6185] # prints data frame => yearID teamID lgID playerID salary 6179 1993 NYA AL jamesdi01 0 6194 1993 NYA AL silveda01 10900 6659 1994 CHA AL carych01 50000 9679 1997 FLO NL penaal01 50000 12007 1999 PIT NL martija02 0 where => yearID teamID lgID playerID salary 6175 NaN NaN NaN NaN NaN 6176 NaN NaN NaN NaN NaN 6177 NaN NaN NaN NaN NaN 6178 NaN NaN NaN NaN NaN 6179 1993.0 NYA AL jamesdi01 0.0 6180 NaN NaN NaN NaN NaN 6181 NaN NaN NaN NaN NaN 6182 NaN NaN NaN NaN NaN 6183 NaN NaN NaN NaN NaN 6184 NaN NaN NaN NaN NaN
Summary
This article presented some ways in which a DataFrame supports constructs similar to SQL for data selection. However the DataFrame data selection is more powerful especially since you can use arbitrary functions as predicates.