Pandas Tutorial – Selecting Rows From a DataFrame

Learn the various ways of selecting data from a DataFrame.

pandas DataFrame selection

“Always and never are two words you should always remember never to use. ”
― Wendell Johnson

1. Introduction

After covering ways of creating a DataFrame and working with it, we now concentrate on extracting data from the DataFrame. You may also be interested in our tutorials on a related data structure – Series; part 1 and part 2.

Getting Started

Import these libraries: pandas, matplotlib for plotting and numpy.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random

If you are working with a Jupyter (or iPython) notebook and want to show graphs inline, use this definition.

%matplotlib inline

2. Using Python Array Slice Syntax

The standard python array slice syntax x[apos:bpos:incr] can be used to extract a range of rows from a DataFrame. However, the pandas documentation recommends the use of more efficient row access methods presented below.

2.1. First Few Rows

You can also use the head() method for this operation.

x = pd.read_csv('big-data/Salaries.csv')
print x[:4]

# prints
   yearID teamID lgID   playerID  salary
0    1985    ATL   NL  barkele01  870000
1    1985    ATL   NL  bedrost01  550000
2    1985    ATL   NL  benedbr01  545000
3    1985    ATL   NL   campri01  633333

2.2. Last Few Rows

This can also be done using the tail() method.

print x[-4:]

# prints
       yearID teamID lgID   playerID    salary
26424    2016    WSN   NL  taylomi02    524000
26425    2016    WSN   NL  treinbl01    524900
26426    2016    WSN   NL  werthja01  21733615
26427    2016    WSN   NL  zimmery01  14000000

2.3. Range of Rows

print x[2:4]

# prints
   yearID teamID lgID   playerID  salary
2    1985    ATL   NL  benedbr01  545000
3    1985    ATL   NL   campri01  633333

2.4. Reversing Rows

Example for reversing rows using a negative step:

print 'reverse order =>\n', x[:-6:-1]

# prints
reverse order =>
       yearID teamID lgID   playerID    salary
26427    2016    WSN   NL  zimmery01  14000000
26426    2016    WSN   NL  werthja01  21733615
26425    2016    WSN   NL  treinbl01    524900
26424    2016    WSN   NL  taylomi02    524000
26423    2016    WSN   NL  strasst01  10400000

3. Using Accelerated Selectors

Pandas recommends the use of these selectors for extracting rows in production code, rather than the python array slice syntax shown above.

3.1. ix[label] or ix[pos]

Select row by index label. Or by integer position if label search fails.

See examples below under iloc[pos] and loc[label].

3.2. iloc[pos]

Select row by integer position. Raises IndexError if position not valid (position not between 0 and length – 1).

A Single Position
y = pd.DataFrame({'one': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
                 'two': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
                 'three': pd.Series(list('ABCDE'), index=list('abcde'))})
print 'data =>\n', y, '\n'
print 'single pos =>\n', y.iloc[1], '\n'

# prints
data =>
   one three  two
a   88     A   81
b   39     B   36
c   43     C   78
d   86     D   21
e   11     E   26 

single pos =>
one      39
three     B
two      36
Name: b, dtype: object
Positions Array

Array of discrete positions. Note that position indices may be repeated and in any order.

print 'pos array =>\n', y.iloc[[3, 2, 3]], '\n'

# prints
pos array =>
   one three  two
d   86     D   21
c   43     C   78
d   86     D   21
Positions Slice
print 'pos slice =>\n', y.iloc[1:3], '\n'

# prints
pos slice =>
   one three  two
b   39     B   36
c   43     C   78
Boolean Array
print 'boolean array =>\n', y.iloc[[True, False, True]], '\n'

# prints
boolean array =>
   one three  two
a   88     A   81
c   43     C   78

3.3. loc[label]

Select row by index label. Raises KeyError when the label is not found. Following arguments are supported.

A Single Label:
y = pd.DataFrame({'one': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
                 'two': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
                 'three': pd.Series(list('ABCDE'), index=list('abcde'))})
print 'data =>\n', y, '\n'
print 'single label =>\n', y.loc['a'], '\n'

# prints
data =>
   one three  two
a   46     A   25
b    9     B   80
c   91     C   70
d   95     D   42
e   83     E   16 

single label =>
one      46
three     A
two      25
Name: a, dtype: object
An Array of Labels
print 'label array =>\n', y.loc[['b', 'c']], '\n'

# prints
label array =>
   one three  two
b    9     B   80
c   91     C   70
A Label Slice
print 'label slice =>\n', y.loc['b': 'd'], '\n'

# prints
label slice =>
   one three  two
b    9     B   80
c   91     C   70
d   95     D   42
A Boolean Array
print 'boolean array =>\n', y.loc[[True, False, True, True]], '\n'

# prints
boolean array =>
   one three  two
a   46     A   25
c   91     C   70
d   95     D   42

4. Fast Scalar Access

Pandas recommends that for fast access of scalar values, you can use at() and iat().

4.1. at[rowlabel, colname]

With at(), you need to specify the row label for the first argument and the column name for the second.

y = pd.DataFrame({'one': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
                 'two': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
                 'three': pd.Series(list('ABCDE'), index=list('abcde'))})
print y, '\n'
print 'at =>', y.at['b', 'one']

# prints
   one three  two
a   97     A   18
b   17     B   15
c   33     C   60
d   11     D   88
e    0     E   92 

at => 17

4.2. iat[rowpos, colpos]

When using iat(), both arguments need to be integer positions of the row and column respectively.

print 'iat =>', y.iat[2, 1]

# prints
iat => C

5. Take a Few Rows

Both Series and DataFrame support a method take() which accepts a list of indices and returns rows at those indices.

x = pd.read_csv('big-data/Salaries.csv')
print 'head =>\n', x.head(), '\n'
print x.take([0, 3, 4])

# prints
head =>
   yearID teamID lgID   playerID  salary
0    1985    ATL   NL  barkele01  870000
1    1985    ATL   NL  bedrost01  550000
2    1985    ATL   NL  benedbr01  545000
3    1985    ATL   NL   campri01  633333
4    1985    ATL   NL  ceronri01  625000 

   yearID teamID lgID   playerID  salary
0    1985    ATL   NL  barkele01  870000
3    1985    ATL   NL   campri01  633333
4    1985    ATL   NL  ceronri01  625000

You can also take() some columns by specifying the column indices along with the argument axis=1 to indicate a column-wise operation.

print 'take cols =>\n', x.take([0, 1, 3], axis=1).head()

# prints
   yearID teamID   playerID
0    1985    ATL  barkele01
1    1985    ATL  bedrost01
2    1985    ATL  benedbr01
3    1985    ATL   campri01
4    1985    ATL  ceronri01

The take() method works with integer positions and not index labels as shown here when you change the index as follows:

x = x.set_index('yearID')
print 'index changed =>\n', x.head(), '\n'
print 'took rows =>\n', x.take([0, 3, 4])

# prints
index changed =>
       teamID lgID   playerID  salary
yearID                               
1985      ATL   NL  barkele01  870000
1985      ATL   NL  bedrost01  550000
1985      ATL   NL  benedbr01  545000
1985      ATL   NL   campri01  633333
1985      ATL   NL  ceronri01  625000 

took rows =>
       teamID lgID   playerID  salary
yearID                               
1985      ATL   NL  barkele01  870000
1985      ATL   NL   campri01  633333
1985      ATL   NL  ceronri01  625000

Summary

This article presented some ways of selecting data from a DataFrame. We covered the python array slice syntax, and the attribute extractors ix, iloc and loc. For fast single value extraction, we use at and iat attributes.

One thought on “Pandas Tutorial – Selecting Rows From a DataFrame”

Leave a Reply

Your email address will not be published. Required fields are marked *