“Always and never are two words you should always remember never to use. ”
― Wendell Johnson
Contents
1. Introduction
After covering ways of creating a DataFrame and working with it, we now concentrate on extracting data from the DataFrame. You may also be interested in our tutorials on a related data structure – Series; part 1 and part 2.
Getting Started
Import these libraries: pandas, matplotlib for plotting and numpy.
import pandas as pd import matplotlib.pyplot as plt import numpy as np import random
If you are working with a Jupyter (or iPython) notebook and want to show graphs inline, use this definition.
%matplotlib inline
2. Using Python Array Slice Syntax
The standard python array slice syntax x[apos:bpos:incr] can be used to extract a range of rows from a DataFrame. However, the pandas documentation recommends the use of more efficient row access methods presented below.
2.1. First Few Rows
You can also use the head() method for this operation.
x = pd.read_csv('big-data/Salaries.csv') print x[:4] # prints yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 1 1985 ATL NL bedrost01 550000 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333
2.2. Last Few Rows
This can also be done using the tail() method.
print x[-4:] # prints yearID teamID lgID playerID salary 26424 2016 WSN NL taylomi02 524000 26425 2016 WSN NL treinbl01 524900 26426 2016 WSN NL werthja01 21733615 26427 2016 WSN NL zimmery01 14000000
2.3. Range of Rows
print x[2:4] # prints yearID teamID lgID playerID salary 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333
2.4. Reversing Rows
Example for reversing rows using a negative step:
print 'reverse order =>\n', x[:-6:-1] # prints reverse order => yearID teamID lgID playerID salary 26427 2016 WSN NL zimmery01 14000000 26426 2016 WSN NL werthja01 21733615 26425 2016 WSN NL treinbl01 524900 26424 2016 WSN NL taylomi02 524000 26423 2016 WSN NL strasst01 10400000
3. Using Accelerated Selectors
Pandas recommends the use of these selectors for extracting rows in production code, rather than the python array slice syntax shown above.
3.1. ix[label] or ix[pos]
Select row by index label. Or by integer position if label search fails.
See examples below under iloc[pos] and loc[label].
3.2. iloc[pos]
Select row by integer position. Raises IndexError if position not valid (position not between 0 and length – 1).
A Single Position
y = pd.DataFrame({'one': pd.Series(random.sample(xrange(100), 5), index=list('abcde')), 'two': pd.Series(random.sample(xrange(100), 5), index=list('abcde')), 'three': pd.Series(list('ABCDE'), index=list('abcde'))}) print 'data =>\n', y, '\n' print 'single pos =>\n', y.iloc[1], '\n' # prints data => one three two a 88 A 81 b 39 B 36 c 43 C 78 d 86 D 21 e 11 E 26 single pos => one 39 three B two 36 Name: b, dtype: object
Positions Array
Array of discrete positions. Note that position indices may be repeated and in any order.
print 'pos array =>\n', y.iloc[[3, 2, 3]], '\n' # prints pos array => one three two d 86 D 21 c 43 C 78 d 86 D 21
Positions Slice
print 'pos slice =>\n', y.iloc[1:3], '\n' # prints pos slice => one three two b 39 B 36 c 43 C 78
Boolean Array
print 'boolean array =>\n', y.iloc[[True, False, True]], '\n' # prints boolean array => one three two a 88 A 81 c 43 C 78
3.3. loc[label]
Select row by index label. Raises KeyError when the label is not found. Following arguments are supported.
A Single Label:
y = pd.DataFrame({'one': pd.Series(random.sample(xrange(100), 5), index=list('abcde')), 'two': pd.Series(random.sample(xrange(100), 5), index=list('abcde')), 'three': pd.Series(list('ABCDE'), index=list('abcde'))}) print 'data =>\n', y, '\n' print 'single label =>\n', y.loc['a'], '\n' # prints data => one three two a 46 A 25 b 9 B 80 c 91 C 70 d 95 D 42 e 83 E 16 single label => one 46 three A two 25 Name: a, dtype: object
An Array of Labels
print 'label array =>\n', y.loc[['b', 'c']], '\n' # prints label array => one three two b 9 B 80 c 91 C 70
A Label Slice
print 'label slice =>\n', y.loc['b': 'd'], '\n' # prints label slice => one three two b 9 B 80 c 91 C 70 d 95 D 42
A Boolean Array
print 'boolean array =>\n', y.loc[[True, False, True, True]], '\n' # prints boolean array => one three two a 46 A 25 c 91 C 70 d 95 D 42
4. Fast Scalar Access
Pandas recommends that for fast access of scalar values, you can use at() and iat().
4.1. at[rowlabel, colname]
With at(), you need to specify the row label for the first argument and the column name for the second.
y = pd.DataFrame({'one': pd.Series(random.sample(xrange(100), 5), index=list('abcde')), 'two': pd.Series(random.sample(xrange(100), 5), index=list('abcde')), 'three': pd.Series(list('ABCDE'), index=list('abcde'))}) print y, '\n' print 'at =>', y.at['b', 'one'] # prints one three two a 97 A 18 b 17 B 15 c 33 C 60 d 11 D 88 e 0 E 92 at => 17
4.2. iat[rowpos, colpos]
When using iat(), both arguments need to be integer positions of the row and column respectively.
print 'iat =>', y.iat[2, 1] # prints iat => C
5. Take a Few Rows
Both Series and DataFrame support a method take() which accepts a list of indices and returns rows at those indices.
x = pd.read_csv('big-data/Salaries.csv') print 'head =>\n', x.head(), '\n' print x.take([0, 3, 4]) # prints head => yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 1 1985 ATL NL bedrost01 550000 2 1985 ATL NL benedbr01 545000 3 1985 ATL NL campri01 633333 4 1985 ATL NL ceronri01 625000 yearID teamID lgID playerID salary 0 1985 ATL NL barkele01 870000 3 1985 ATL NL campri01 633333 4 1985 ATL NL ceronri01 625000
You can also take() some columns by specifying the column indices along with the argument axis=1 to indicate a column-wise operation.
print 'take cols =>\n', x.take([0, 1, 3], axis=1).head() # prints yearID teamID playerID 0 1985 ATL barkele01 1 1985 ATL bedrost01 2 1985 ATL benedbr01 3 1985 ATL campri01 4 1985 ATL ceronri01
The take() method works with integer positions and not index labels as shown here when you change the index as follows:
x = x.set_index('yearID') print 'index changed =>\n', x.head(), '\n' print 'took rows =>\n', x.take([0, 3, 4]) # prints index changed => teamID lgID playerID salary yearID 1985 ATL NL barkele01 870000 1985 ATL NL bedrost01 550000 1985 ATL NL benedbr01 545000 1985 ATL NL campri01 633333 1985 ATL NL ceronri01 625000 took rows => teamID lgID playerID salary yearID 1985 ATL NL barkele01 870000 1985 ATL NL campri01 633333 1985 ATL NL ceronri01 625000
Summary
This article presented some ways of selecting data from a DataFrame. We covered the python array slice syntax, and the attribute extractors ix, iloc and loc. For fast single value extraction, we use at and iat attributes.
i really appreciate you .
it is very useful for me