Python Pandas Tutorial – DataFrame

Learn the basics of creating a DataFrame in this tutorial series on pandas.

pandas DataFrame

1. Introduction

This is the next part of the pandas tutorial. In a previous article, we covered the pandas Series class. Today we are getting started with the main pandas data structure, the DataFrame.

In this article, we cover some ways of initializing a DataFrame, including loading it with data from external files such as CSV and XLSX files.

2. What is a DataFrame?

A DataFrame is a table-like structure with rows and columns. Think of it as a database table or a spreadsheet table. Each column has a name (or a column header) and each row has a row label (also called an index). The following is an example of a DataFrame. The column headers are “one”, “two” and “three”. The row labels are “a”, “b” through “e”.

   one  three  two
a   21     44   56
b   96     76   74
c   20     72   52
d   97     53   95
e   77      3   91

The DataFrame class provides many methods and operations which make it easy to work with tabular data.

3. Creating a DataFrame

There are many ways to create a DataFrame. Here we show some of the common ones.

Before starting, you need to import pandas into your program.

import pandas as pd

3.1. Using a Dict of Lists

A DataFrame is a collection of rows and columns. To create one, you can specify a dict with each column label mapped to the column data. A simple example is shown below.

print pd.DataFrame({'joe': [2, 4, 6, 8, 10],
                   'jack': [1, 3, 5, 7, 9],
                   'jim': [0, 3, 6, 9, 12]})

# prints
   jack  jim  joe
0     1    0    2
1     3    3    4
2     5    6    6
3     7    9    8
4     9   12   10

The lists used for the column data must be all of the same size or else an exception is thrown.

x = pd.DataFrame({'joe': [2, 4, 6, 8, 10],
                   'jack': [1, 3, 5, 7],
                   'jim': [0, 3, 6, 9, 12]})

# prints
ValueError: arrays must all be same length

3.2. Specifying an Index

By default, an index of integers from 0 through N (the number of rows) is automatically generated. To specify your own index, use the index parameter.

x = pd.DataFrame({'joe': [2, 4, 6, 8, 10],
                   'jack': [1, 3, 5, 7, 9],
                   'jim': [0, 3, 6, 9, 12]},
                index=[x for x in 'abcde'])
print x

# prints
   jack  jim  joe
a     1    0    2
b     3    3    4
c     5    6    6
d     7    9    8
e     9   12   10

3.3. From a Dict of Series

Create a DataFrame from a set of Series objects as follows. It is similar to using a dict of lists except that the Series objects need not all be the same size. Also all the dict values need not be Series objects.

print pd.DataFrame({'joe': pd.Series([2, 4, 6, 8, 10]),
                    'jack': pd.Series([1, 3, 5, 7]),
                    'jim': [0, 3, 6, 9, 12]})

# prints
   jack  jim  joe
0   1.0    0    2
1   3.0    3    4
2   5.0    6    6
3   7.0    9    8
4   NaN   12   10

Of course you can specify the index for each Series, in which case missing elements are assigned the value NaN as shown below.

print pd.DataFrame({'joe': pd.Series([2, 4, 6, 8, 10], index=[x for x in 'abcde']),
                   'jack': pd.Series([1, 3, 5, 7], index=[x for x in 'bcde']),
                   'jim': pd.Series([0, 3, 6, 9, 12], index=[x for x in 'cdefg'])})

# prints
   jack   jim   joe
a   NaN   NaN   2.0
b   1.0   NaN   4.0
c   3.0   0.0   6.0
d   5.0   3.0   8.0
e   7.0   6.0  10.0
f   NaN   9.0   NaN
g   NaN  12.0   NaN

When the columns parameter is specified, the key-value mapping from the dict is ignored if the key is not present in the column array. In the following definition, key-value mapping for “jack” is ignored since it is not present in the columns array.

print pd.DataFrame({'joe': pd.Series([2, 4, 6, 8, 10]),
                    'jack': pd.Series([1, 3, 5, 7]),
                    'jim': [0, 3, 6, 9, 12]},
                  columns=['joe', 'jim', 'jane'])

# prints
   joe  jim jane
0    2    0  NaN
1    4    3  NaN
2    6    6  NaN
3    8    9  NaN
4   10   12  NaN

3.4. With a List of Dicts

Another way to initialize a DataFrame is to use a list of dicts which have same or similar keys. The keys become the column labels.

x = [{'jane': 34, 'jim': 99, 'joe': 29, 'jack': 19},
     {'jane': 78, 'jim': 32, 'joe': 28, 'jack': 60},
     {'jane': 23, 'jim': 8, 'joe': 20, 'jack': 37}] 
print pd.DataFrame(x)

# prints
   jack  jane  jim  joe
0    19    34   99   29
1    60    78   32   28
2    37    23    8   20

3.5. From a Series

A single-columned DataFrame can be created with a Series as the argument. The name of the Series is the column label.

print pd.DataFrame(pd.Series([x for x in 'abcde'], name='joe'))

# prints
  joe
0   a
1   b
2   c
3   d
4   e

4. Loading Data from Files

In addition to all the methods detailed above for creating a DataFrame, you can load data from a file (and other sources) to create a DataFrame. Let us look at some simple possibilities now.

4.1. CSV Data

Probably the most commonly used method in data analysis, you can load CSV data from a file using read_csv(). A DataFrame object is returned, the first 10 rows of which are shown.

x = pd.read_csv('big-data/StateNames.csv')
print x[:10]

# prints
   Id      Name  Year Gender State  Count
0   1      Mary  1910      F    AK     14
1   2     Annie  1910      F    AK     12
2   3      Anna  1910      F    AK     10
3   4  Margaret  1910      F    AK      8
4   5     Helen  1910      F    AK      7
5   6     Elsie  1910      F    AK      6
6   7      Lucy  1910      F    AK      6
7   8   Dorothy  1910      F    AK      5
8   9      Mary  1911      F    AK     12
9  10  Margaret  1911      F    AK      7

The method read_csv() also accepts a URL from which the data can be directly loaded. Here is a slightly more complex CSV loaded directly from a URL.

food = pd.read_csv('https://data.sfgov.org/api/views/jjew-r69b/rows.csv?accessType=DOWNLOAD')
print food[:5]

# prints (not everything shown)
   DayOrder DayOfWeekStr starttime endtime      permit  \
0         5       Friday      12PM     1PM  13MFF-0102   
1         3    Wednesday      12PM     1PM  14MFF-0109   
2         2      Tuesday       9AM    10AM  14MFF-0001   
3         2      Tuesday       7AM     4PM  13MFF-0123   
4         1       Monday       1PM     2PM  13MFF-0102   

                  PermitLocation  \
0   Assessors Block 4103/Lot023A   
1                    432 CLAY ST   
2                 1150 FOLSOM ST   
3                    235 15TH ST   
4  Assessors Block 4281a/Lot012A   

                                        locationdesc  \
0       Setup at 650 Pennsylvania Ave. 1:10pm-1:20pm   
1  Set-up at 400 Clay St: 7:00am-7:10am, 10:00am-...   
2                                      9:30am-9:35am   
...

For the full scoop on the powerful read_csv() method, refer to the documentation.

4.2. JSON Data

Quite simple enough to parse JSON data too using read_json().

x = pd.read_json('big-data/Salaries.json')
print x[:5]

# prints
  leagueId   playerId  salary teamId  year
0       NL  barkele01  870000    ATL  1985
1       NL  bedrost01  550000    ATL  1985
2       NL  benedbr01  545000    ATL  1985
3       NL   campri01  633333    ATL  1985
4       NL  ceronri01  625000    ATL  1985

4.3. Excel 2003 and 2007+

Load data from Excel 2003 (XLS) files or Excel 2007+ (XLSX) files using read_excel().

x = pd.read_excel('big-data/Sales.xls')
print x[:5]

# prints
   Row ID  Order ID Order Date Order Priority  Order Quantity       Sales  \
0       1         3 2010-10-13            Low               6    261.5400   
1      49       293 2012-10-01           High              49  10123.0200   
2      50       293 2012-10-01           High              27    244.5700   
3      80       483 2011-07-10           High              30   4965.7595   
4      85       515 2010-08-28  Not Specified              19    394.2700   

   Discount       Ship Mode     Profit  Unit Price    ...      \
0      0.04     Regular Air  -213.2500       38.94    ...       
1      0.07  Delivery Truck   457.8100      208.16    ...       
...

If you have an XLS (or XLSX) file with multiple sheets, you can load the workbook first and then extract the required sheet from it as follows:

xls = pd.ExcelFile('big-data/Sales.xls')
y = pd.read_excel(xls, sheetname='Orders')

Summary

The DataFrame class is the main workhorse of the pandas toolkit. In this article, we learnt some basic methods of creating and populating a DataFrame object. These included using lists, series and dicts to create a DataFrame, as well as loading data from external CSV, JSON and Excel files.

Leave a Reply

Your email address will not be published. Required fields are marked *