“Always and never are two words you should always remember never to use. ”
― Wendell Johnson
Contents
1. Introduction
After covering ways of creating a DataFrame and working with it, we now concentrate on extracting data from the DataFrame. You may also be interested in our tutorials on a related data structure – Series; part 1 and part 2.
Getting Started
Import these libraries: pandas, matplotlib for plotting and numpy.
import pandas as pd import matplotlib.pyplot as plt import numpy as np import random
If you are working with a Jupyter (or iPython) notebook and want to show graphs inline, use this definition.
%matplotlib inline
2. Using Python Array Slice Syntax
The standard python array slice syntax x[apos:bpos:incr] can be used to extract a range of rows from a DataFrame. However, the pandas documentation recommends the use of more efficient row access methods presented below.
2.1. First Few Rows
You can also use the head() method for this operation.
x = pd.read_csv('big-data/Salaries.csv')
print x[:4]
# prints
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
2.2. Last Few Rows
This can also be done using the tail() method.
print x[-4:]
# prints
yearID teamID lgID playerID salary
26424 2016 WSN NL taylomi02 524000
26425 2016 WSN NL treinbl01 524900
26426 2016 WSN NL werthja01 21733615
26427 2016 WSN NL zimmery01 14000000
2.3. Range of Rows
print x[2:4] # prints yearID teamID lgID playerID salary 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333
2.4. Reversing Rows
Example for reversing rows using a negative step:
print 'reverse order =>\n', x[:-6:-1]
# prints
reverse order =>
yearID teamID lgID playerID salary
26427 2016 WSN NL zimmery01 14000000
26426 2016 WSN NL werthja01 21733615
26425 2016 WSN NL treinbl01 524900
26424 2016 WSN NL taylomi02 524000
26423 2016 WSN NL strasst01 10400000
3. Using Accelerated Selectors
Pandas recommends the use of these selectors for extracting rows in production code, rather than the python array slice syntax shown above.
3.1. ix[label] or ix[pos]
Select row by index label. Or by integer position if label search fails.
See examples below under iloc[pos] and loc[label].
3.2. iloc[pos]
Select row by integer position. Raises IndexError if position not valid (position not between 0 and length – 1).
A Single Position
y = pd.DataFrame({'one': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
'two': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
'three': pd.Series(list('ABCDE'), index=list('abcde'))})
print 'data =>\n', y, '\n'
print 'single pos =>\n', y.iloc[1], '\n'
# prints
data =>
one three two
a 88 A 81
b 39 B 36
c 43 C 78
d 86 D 21
e 11 E 26
single pos =>
one 39
three B
two 36
Name: b, dtype: object
Positions Array
Array of discrete positions. Note that position indices may be repeated and in any order.
print 'pos array =>\n', y.iloc[[3, 2, 3]], '\n' # prints pos array => one three two d 86 D 21 c 43 C 78 d 86 D 21
Positions Slice
print 'pos slice =>\n', y.iloc[1:3], '\n' # prints pos slice => one three two b 39 B 36 c 43 C 78
Boolean Array
print 'boolean array =>\n', y.iloc[[True, False, True]], '\n' # prints boolean array => one three two a 88 A 81 c 43 C 78
3.3. loc[label]
Select row by index label. Raises KeyError when the label is not found. Following arguments are supported.
A Single Label:
y = pd.DataFrame({'one': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
'two': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
'three': pd.Series(list('ABCDE'), index=list('abcde'))})
print 'data =>\n', y, '\n'
print 'single label =>\n', y.loc['a'], '\n'
# prints
data =>
one three two
a 46 A 25
b 9 B 80
c 91 C 70
d 95 D 42
e 83 E 16
single label =>
one 46
three A
two 25
Name: a, dtype: object
An Array of Labels
print 'label array =>\n', y.loc[['b', 'c']], '\n' # prints label array => one three two b 9 B 80 c 91 C 70
A Label Slice
print 'label slice =>\n', y.loc['b': 'd'], '\n' # prints label slice => one three two b 9 B 80 c 91 C 70 d 95 D 42
A Boolean Array
print 'boolean array =>\n', y.loc[[True, False, True, True]], '\n' # prints boolean array => one three two a 46 A 25 c 91 C 70 d 95 D 42
4. Fast Scalar Access
Pandas recommends that for fast access of scalar values, you can use at() and iat().
4.1. at[rowlabel, colname]
With at(), you need to specify the row label for the first argument and the column name for the second.
y = pd.DataFrame({'one': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
'two': pd.Series(random.sample(xrange(100), 5), index=list('abcde')),
'three': pd.Series(list('ABCDE'), index=list('abcde'))})
print y, '\n'
print 'at =>', y.at['b', 'one']
# prints
one three two
a 97 A 18
b 17 B 15
c 33 C 60
d 11 D 88
e 0 E 92
at => 17
4.2. iat[rowpos, colpos]
When using iat(), both arguments need to be integer positions of the row and column respectively.
print 'iat =>', y.iat[2, 1] # prints iat => C
5. Take a Few Rows
Both Series and DataFrame support a method take() which accepts a list of indices and returns rows at those indices.
x = pd.read_csv('big-data/Salaries.csv')
print 'head =>\n', x.head(), '\n'
print x.take([0, 3, 4])
# prints
head =>
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
You can also take() some columns by specifying the column indices along with the argument axis=1 to indicate a column-wise operation.
print 'take cols =>\n', x.take([0, 1, 3], axis=1).head() # prints yearID teamID playerID 0 1985 ATL barkele01 1 1985 ATL bedrost01 2 1985 ATL benedbr01 3 1985 ATL campri01 4 1985 ATL ceronri01
The take() method works with integer positions and not index labels as shown here when you change the index as follows:
x = x.set_index('yearID')
print 'index changed =>\n', x.head(), '\n'
print 'took rows =>\n', x.take([0, 3, 4])
# prints
index changed =>
teamID lgID playerID salary
yearID
1985 ATL NL barkele01 870000
1985 ATL NL bedrost01 550000
1985 ATL NL benedbr01 545000
1985 ATL NL campri01 633333
1985 ATL NL ceronri01 625000
took rows =>
teamID lgID playerID salary
yearID
1985 ATL NL barkele01 870000
1985 ATL NL campri01 633333
1985 ATL NL ceronri01 625000
Summary
This article presented some ways of selecting data from a DataFrame. We covered the python array slice syntax, and the attribute extractors ix, iloc and loc. For fast single value extraction, we use at and iat attributes.
i really appreciate you .
it is very useful for me