Pandas Tutorial – SQL-like Data Selection

Did you know that you can perform SQL-like selections with a pandas DataFrame? Learn how!

“Always keep your words soft and sweet, just in case you have to eat them.”
― Andy Rooney

1. Introduction

In this article, we present SQL-like ways of selecting data from a pandas DataFrame. The SELECT clause is very familiar to database programmers for accessing data within an SQL database. The DataFrame provides similar functionality when working with datasets, but is far more powerful since it supports using predicate functions with a simple syntax.

You may want to review these articles for an introduction to the pandas DataFrame:

  1. Part 1 – DataFrame creation.
  2. Part 2 – Basics.
  3. Part 3 – Accessing data within a DataFrame.

2. Boolean Operators

These are the boolean operators. We draw special attention to them here since they are different from normal python boolean operators and are applicable only with DataFrame and Series selection.

These operators must be grouped using parentheses.

Operator Meaning
& And
| Or
~ Not

3. Simple Expression

The following example shows an example of a simple selection using a boolean expression.

x = pd.read_csv('big-data/Salaries.csv')
print 'head =>\n', x.head(), '\n'
a = x['salary']
print 'simple expr =>\n', a[a < 60000]

# prints
head =>
   yearID teamID lgID   playerID  salary
0    1985    ATL   NL  barkele01  870000
1    1985    ATL   NL  bedrost01  550000
2    1985    ATL   NL  benedbr01  545000
3    1985    ATL   NL   campri01  633333
4    1985    ATL   NL  ceronri01  625000 

simple expr =>
6179         0
6194     10900
6659     50000
9679     50000
12007        0
Name: salary, dtype: int64

The above showed an example of boolean selection with a Series (by selecting a column “salary“). Here is how you can do the same with a DataFrame.

print 'data frame =>\n', x[x.salary < 60000]

# prints
data frame =>
       yearID teamID lgID   playerID  salary
6179     1993    NYA   AL  jamesdi01       0
6194     1993    NYA   AL  silveda01   10900
6659     1994    CHA   AL   carych01   50000
9679     1997    FLO   NL   penaal01   50000
12007    1999    PIT   NL  martija02       0

This is similar to the following SQL statement:

select * from salary where salary < 60000;

yearID      teamID      lgID        playerID    salary
----------  ----------  ----------  ----------  ----------
1993        NYA         AL          jamesdi01   0
1993        NYA         AL          silveda01   10900
1994        CHA         AL          carych01    50000
1997        FLO         NL          penaal01    50000
1999        PIT         NL          martija02   0

4. Using Boolean Series

Boolean expression selection works as follows: Evaluating a boolean condition results in a Series of boolean values which is used to select rows with True values. Check out this example.

print 'boolean series =>\n', (x.salary < 60000).head(), '\n'
print 'value counts =>\n', (x.salary < 60000).value_counts(), '\n'

# prints
boolean series =>
0    False
1    False
2    False
3    False
4    False
Name: salary, dtype: bool 

value counts =>
False    26423
True         5
Name: salary, dtype: int64

5. Compound Expressions

Combine boolean conditions by enclosing each expression in parentheses and using one of the operators listed in the above table.

print x[(x.salary < 60000) & (x.teamID == 'NYA')]

# prints
      yearID teamID lgID   playerID  salary
6179    1993    NYA   AL  jamesdi01       0
6194    1993    NYA   AL  silveda01   10900

Compare with SQL:

select * from salary where salary < 60000 and teamID == "NYA";

yearID      teamID      lgID        playerID    salary
----------  ----------  ----------  ----------  ----------
1993        NYA         AL          jamesdi01   0
1993        NYA         AL          silveda01   10900

6. Using a Function or Lambda

Another method is to use a function to select the required data.

print x[lambda a : a.salary < 60000]

# prints
       yearID teamID lgID   playerID  salary
6179     1993    NYA   AL  jamesdi01       0
6194     1993    NYA   AL  silveda01   10900
6659     1994    CHA   AL   carych01   50000
9679     1997    FLO   NL   penaal01   50000
12007    1999    PIT   NL  martija02       0

This is something you cannot do directly and quickly in SQL (when complex conditions are evaluated in the lambda). Maybe stored procedures, but that is whole another beast from SQL.

Again, combining boolean expressions in normal pythonic way wont work. It results in an error. (BTW, the error message could have been more meaningful).

print x[lambda a : a.salary > 500000 and a.salary < 600000]

# Results in:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Even in a lambda, you need to combine boolean expressions using the operators listed above.

print x[lambda a : (a.salary > 60000) & (a.salary < 62500)]

# prints
     yearID teamID lgID   playerID  salary
888    1986    LAN   NL  powelde01   62000
971    1986    MON   NL  sebrabo01   61500

7. Boolean Negation

Negate a boolean expression using ~.

print x[~((x.salary < 50000) & (x.teamID == 'NYA'))]

# prints
       yearID teamID lgID   playerID    salary
0        1985    ATL   NL  barkele01    870000
1        1985    ATL   NL  bedrost01    550000
2        1985    ATL   NL  benedbr01    545000
3        1985    ATL   NL   campri01    633333
...

8. Membership Check

Want to check if elements of a Series contains these values? Use a membership check using isin().

8.1. Series Example

print 'a sample =>\n', x['salary'].head(10), '\n'
print 'isin =>\n', x['salary'].isin([550000, 772000, 250000, 900]).head(10), '\n'

# prints
a sample =>
0    870000
1    550000
2    545000
3    633333
4    625000
5    800000
6    150000
7    483333
8    772000
9    250000
Name: salary, dtype: int64 

isin =>
0    False
1     True
2    False
3    False
4    False
5    False
6    False
7    False
8     True
9     True
Name: salary, dtype: bool

8.2. DataFrame Example

Use the boolean Series returned from isin() to select from the DataFrame.

print 'membership check =>\n', x[x['salary'].isin([550000, 772000, 250000, 900])].head()

# prints
membership check =>
     yearID teamID lgID   playerID  salary
1      1985    ATL   NL  bedrost01  550000
8      1985    ATL   NL  garbege01  772000
9      1985    ATL   NL  harpete01  250000
96     1985    CAL   AL   zahnge01  550000
121    1985    CHN   NL  davisjo02  550000

Here is the SQL equivalent:

select * from salary where salary in (550000, 772000, 250000, 900) limit 5;

yearID      teamID      lgID        playerID    salary
----------  ----------  ----------  ----------  ----------
1985        ATL         NL          bedrost01   550000
1985        ATL         NL          garbege01   772000
1985        ATL         NL          harpete01   250000
1985        CAL         AL          zahnge01    550000
1985        CHN         NL          davisjo02   550000

8.3. Using on Index Column

The same method can be used on an index to select based on label membership too.

print x[x.index.isin([0, 100, 1000, 10000, 100000])]

# prints
       yearID teamID lgID   playerID  salary
0        1985    ATL   NL  barkele01  870000
100      1985    CHA   AL  burnsbr01  727500
1000     1986    NYA   AL  pasquda01   70000
10000    1997    PHI   NL  jordake01  175000

(No simple way to do the above in SQL since it requires the use of a ROW_NUMBER concept which not all SQL databases support.)

The method isin() can be directly used on a DataFrame by specifying the values to search for in a dict.

print x.isin({'yearID': [1985],
             'teamID': ['ATL','CHA'],
             'lgID': ['NL']})

# prints
...
16      True   True   True    False  False
17      True   True   True    False  False
18      True   True   True    False  False
19      True   True   True    False  False
20      True   True   True    False  False
21      True   True   True    False  False
22      True  False  False    False  False
23      True  False  False    False  False
24      True  False  False    False  False
...

9. Compare Using where()

While using a boolean expression to select data returns only the data which matches, using the same expression in a where() returns a DataFrame which is the same size and shape as the original. Only the rows which match contain the original data, the rest being NaNs.

print 'data frame =>\n', x[x.salary < 60000]
print 'where =>\n', x.where(x.salary < 60000)[6175:6185]

# prints
data frame =>
       yearID teamID lgID   playerID  salary
6179     1993    NYA   AL  jamesdi01       0
6194     1993    NYA   AL  silveda01   10900
6659     1994    CHA   AL   carych01   50000
9679     1997    FLO   NL   penaal01   50000
12007    1999    PIT   NL  martija02       0 

where =>
      yearID teamID lgID   playerID  salary
6175     NaN    NaN  NaN        NaN     NaN
6176     NaN    NaN  NaN        NaN     NaN
6177     NaN    NaN  NaN        NaN     NaN
6178     NaN    NaN  NaN        NaN     NaN
6179  1993.0    NYA   AL  jamesdi01     0.0
6180     NaN    NaN  NaN        NaN     NaN
6181     NaN    NaN  NaN        NaN     NaN
6182     NaN    NaN  NaN        NaN     NaN
6183     NaN    NaN  NaN        NaN     NaN
6184     NaN    NaN  NaN        NaN     NaN

Summary

This article presented some ways in which a DataFrame supports constructs similar to SQL for data selection. However the DataFrame data selection is more powerful especially since you can use arbitrary functions as predicates.

Leave a Reply

Your email address will not be published. Required fields are marked *