Contents
1. Introduction
This is the next part of the pandas tutorial. In a previous article, we covered the pandas Series class. Today we are getting started with the main pandas data structure, the DataFrame.
In this article, we cover some ways of initializing a DataFrame, including loading it with data from external files such as CSV and XLSX files.
2. What is a DataFrame?
A DataFrame is a table-like structure with rows and columns. Think of it as a database table or a spreadsheet table. Each column has a name (or a column header) and each row has a row label (also called an index). The following is an example of a DataFrame. The column headers are “one”, “two” and “three”. The row labels are “a”, “b” through “e”.
one three two a 21 44 56 b 96 76 74 c 20 72 52 d 97 53 95 e 77 3 91
The DataFrame class provides many methods and operations which make it easy to work with tabular data.
3. Creating a DataFrame
There are many ways to create a DataFrame. Here we show some of the common ones.
Before starting, you need to import pandas into your program.
import pandas as pd
3.1. Using a Dict of Lists
A DataFrame is a collection of rows and columns. To create one, you can specify a dict with each column label mapped to the column data. A simple example is shown below.
print pd.DataFrame({'joe': [2, 4, 6, 8, 10], 'jack': [1, 3, 5, 7, 9], 'jim': [0, 3, 6, 9, 12]}) # prints jack jim joe 0 1 0 2 1 3 3 4 2 5 6 6 3 7 9 8 4 9 12 10
The lists used for the column data must be all of the same size or else an exception is thrown.
x = pd.DataFrame({'joe': [2, 4, 6, 8, 10], 'jack': [1, 3, 5, 7], 'jim': [0, 3, 6, 9, 12]}) # prints ValueError: arrays must all be same length
3.2. Specifying an Index
By default, an index of integers from 0 through N (the number of rows) is automatically generated. To specify your own index, use the index
parameter.
x = pd.DataFrame({'joe': [2, 4, 6, 8, 10], 'jack': [1, 3, 5, 7, 9], 'jim': [0, 3, 6, 9, 12]}, index=[x for x in 'abcde']) print x # prints jack jim joe a 1 0 2 b 3 3 4 c 5 6 6 d 7 9 8 e 9 12 10
3.3. From a Dict of Series
Create a DataFrame from a set of Series objects as follows. It is similar to using a dict of lists except that the Series objects need not all be the same size. Also all the dict values need not be Series objects.
print pd.DataFrame({'joe': pd.Series([2, 4, 6, 8, 10]), 'jack': pd.Series([1, 3, 5, 7]), 'jim': [0, 3, 6, 9, 12]}) # prints jack jim joe 0 1.0 0 2 1 3.0 3 4 2 5.0 6 6 3 7.0 9 8 4 NaN 12 10
Of course you can specify the index for each Series, in which case missing elements are assigned the value NaN as shown below.
print pd.DataFrame({'joe': pd.Series([2, 4, 6, 8, 10], index=[x for x in 'abcde']), 'jack': pd.Series([1, 3, 5, 7], index=[x for x in 'bcde']), 'jim': pd.Series([0, 3, 6, 9, 12], index=[x for x in 'cdefg'])}) # prints jack jim joe a NaN NaN 2.0 b 1.0 NaN 4.0 c 3.0 0.0 6.0 d 5.0 3.0 8.0 e 7.0 6.0 10.0 f NaN 9.0 NaN g NaN 12.0 NaN
When the columns
parameter is specified, the key-value mapping from the dict is ignored if the key is not present in the column array. In the following definition, key-value mapping for “jack
” is ignored since it is not present in the columns
array.
print pd.DataFrame({'joe': pd.Series([2, 4, 6, 8, 10]), 'jack': pd.Series([1, 3, 5, 7]), 'jim': [0, 3, 6, 9, 12]}, columns=['joe', 'jim', 'jane']) # prints joe jim jane 0 2 0 NaN 1 4 3 NaN 2 6 6 NaN 3 8 9 NaN 4 10 12 NaN
3.4. With a List of Dicts
Another way to initialize a DataFrame is to use a list of dicts which have same or similar keys. The keys become the column labels.
x = [{'jane': 34, 'jim': 99, 'joe': 29, 'jack': 19}, {'jane': 78, 'jim': 32, 'joe': 28, 'jack': 60}, {'jane': 23, 'jim': 8, 'joe': 20, 'jack': 37}] print pd.DataFrame(x) # prints jack jane jim joe 0 19 34 99 29 1 60 78 32 28 2 37 23 8 20
3.5. From a Series
A single-columned DataFrame can be created with a Series as the argument. The name of the Series is the column label.
print pd.DataFrame(pd.Series([x for x in 'abcde'], name='joe')) # prints joe 0 a 1 b 2 c 3 d 4 e
4. Loading Data from Files
In addition to all the methods detailed above for creating a DataFrame, you can load data from a file (and other sources) to create a DataFrame. Let us look at some simple possibilities now.
4.1. CSV Data
Probably the most commonly used method in data analysis, you can load CSV data from a file using read_csv(). A DataFrame object is returned, the first 10 rows of which are shown.
x = pd.read_csv('big-data/StateNames.csv') print x[:10] # prints Id Name Year Gender State Count 0 1 Mary 1910 F AK 14 1 2 Annie 1910 F AK 12 2 3 Anna 1910 F AK 10 3 4 Margaret 1910 F AK 8 4 5 Helen 1910 F AK 7 5 6 Elsie 1910 F AK 6 6 7 Lucy 1910 F AK 6 7 8 Dorothy 1910 F AK 5 8 9 Mary 1911 F AK 12 9 10 Margaret 1911 F AK 7
The method read_csv() also accepts a URL from which the data can be directly loaded. Here is a slightly more complex CSV loaded directly from a URL.
food = pd.read_csv('https://data.sfgov.org/api/views/jjew-r69b/rows.csv?accessType=DOWNLOAD') print food[:5] # prints (not everything shown) DayOrder DayOfWeekStr starttime endtime permit \ 0 5 Friday 12PM 1PM 13MFF-0102 1 3 Wednesday 12PM 1PM 14MFF-0109 2 2 Tuesday 9AM 10AM 14MFF-0001 3 2 Tuesday 7AM 4PM 13MFF-0123 4 1 Monday 1PM 2PM 13MFF-0102 PermitLocation \ 0 Assessors Block 4103/Lot023A 1 432 CLAY ST 2 1150 FOLSOM ST 3 235 15TH ST 4 Assessors Block 4281a/Lot012A locationdesc \ 0 Setup at 650 Pennsylvania Ave. 1:10pm-1:20pm 1 Set-up at 400 Clay St: 7:00am-7:10am, 10:00am-... 2 9:30am-9:35am ...
For the full scoop on the powerful read_csv() method, refer to the documentation.
4.2. JSON Data
Quite simple enough to parse JSON data too using read_json().
x = pd.read_json('big-data/Salaries.json') print x[:5] # prints leagueId playerId salary teamId year 0 NL barkele01 870000 ATL 1985 1 NL bedrost01 550000 ATL 1985 2 NL benedbr01 545000 ATL 1985 3 NL campri01 633333 ATL 1985 4 NL ceronri01 625000 ATL 1985
4.3. Excel 2003 and 2007+
Load data from Excel 2003 (XLS) files or Excel 2007+ (XLSX) files using read_excel().
x = pd.read_excel('big-data/Sales.xls') print x[:5] # prints Row ID Order ID Order Date Order Priority Order Quantity Sales \ 0 1 3 2010-10-13 Low 6 261.5400 1 49 293 2012-10-01 High 49 10123.0200 2 50 293 2012-10-01 High 27 244.5700 3 80 483 2011-07-10 High 30 4965.7595 4 85 515 2010-08-28 Not Specified 19 394.2700 Discount Ship Mode Profit Unit Price ... \ 0 0.04 Regular Air -213.2500 38.94 ... 1 0.07 Delivery Truck 457.8100 208.16 ... ...
If you have an XLS (or XLSX) file with multiple sheets, you can load the workbook first and then extract the required sheet from it as follows:
xls = pd.ExcelFile('big-data/Sales.xls') y = pd.read_excel(xls, sheetname='Orders')
Summary
The DataFrame class is the main workhorse of the pandas toolkit. In this article, we learnt some basic methods of creating and populating a DataFrame object. These included using lists, series and dicts to create a DataFrame, as well as loading data from external CSV, JSON and Excel files.