Pandas Logo

Getting Started

Pandas is a python library that you install using pip

pip install pandas

or using conda

conda install pandas Most users should install the latest version though for enterprise data projects it is often important to pin to a specific pandas version to ensure API stability.

import pandas as pd

Key Ideas

There are two primary objects - DataFrame and Series. Most commonly, you will work with dataframes, which are similar to excel worksheets or database tables. You can think of a DataFrame as a rectangular data structure made up of a number of columns. Each of the columns would then be a Series.

Every dataframe or series has an Index - which is a special column. If you create a dataframe without specifying an Index, pandas will create a numeric index of range(0, len(df)). Series in the same dataframe share the same index.

Let's create three lists that contain information about cats

names = ['Lucy', 'Bella','Lucy','Nala']
ages = [2,2,3,5]
colors = ['gray', 'tabby', 'white', 'black']

Now we create a dataframe where each of the lists become a column, or in pandas terminology - a series.

df = pd.DataFrame({'name': names, 'age': ages, 'color':colors})
df
name age color
0 Lucy 2 gray
1 Bella 2 tabby
2 Lucy 3 white
3 Nala 5 black

Values

The 3 individual data columns - name, age and color have been combined into a multidimensional array called values.

df.values
array([['Lucy', 2, 'gray'],
       ['Bella', 2, 'tabby'],
       ['Lucy', 3, 'white'],
       ['Nala', 5, 'black']], dtype=object)

This is a numpy array with 4 rows and 3 columns

type(df.values)
numpy.ndarray

You can see the shape of the dataframe by calling df.shape. This delegates to a call to the numpy array's shape. Generally, operations on a pandas dataframe result in operations on the underlying numpy array.

df.shape, df.values.shape
((4, 3), (4, 3))

Columns

You can see the list of columns in the dataframe by calling df.columns. The result for our dataframe are the names we specified as the keys to the dictionary we passed in when we created the dataframe.

df.columns
Index(['name', 'age', 'color'], dtype='object')

Selecting columns

A column can be accessed by passing the column name into the bracket operator e.g. using df[col]. So to get the color column, use df['color']

df['color']
0     gray
1    tabby
2    white
3    black
Name: color, dtype: object

And to get the age column use df['age']. Notice that both the color and age column include the index values [0,1,2,3,4] - because all columns in a dataframe share the same index.

df['age']
0    2
1    2
2    3
3    5
Name: age, dtype: int64

When you select a single column from a Dataframe you get a series. If you want to get a new dataframe with just one column use a column list with one item e.g. df[['age']] instead of df['age']

df[['age']]
age
0 2
1 2
2 3
3 5

Selecting columns - as a property

When a dataframe is created, it's columns are exposed as properties of that dataframe object, so it's possible to use the dot . operator

df.age
0    2
1    2
2    3
3    5
Name: age, dtype: int64

Selecting multiple columns

To select multiple columns pass in a list with the names of the columns.

df[['age', 'color']]
age color
0 2 gray
1 2 tabby
2 3 white
3 5 black

Notice the double brackets [[]]? The outer brackets are used for selections and the inner brackets are the list of columns. The code above is equivalent to this

age_and_color = ['age', 'color']
df[age_and_color]
age color
0 2 gray
1 2 tabby
2 3 white
3 5 black

When you select multiple columns from a dataframe, you get new dataframe with just those columns.

Adding columns

To add a new column use the same bracket access operator [] as for selecting a dataframe

df['weight'] = [1.0,0.5,2.0,2.0]
df
name age color weight
0 Lucy 2 gray 1.0
1 Bella 2 tabby 0.5
2 Lucy 3 white 2.0
3 Nala 5 black 2.0

Droping columns

To drop columns pass the names of columns to be dropped in a list

df.drop(columns=['weight'])
name age color
0 Lucy 2 gray
1 Bella 2 tabby
2 Lucy 3 white
3 Nala 5 black

Indexes

An absolute beginner to pandas may not need to know about indexes, especially since pandas creates indexes autmatically for you. However considers indexes to be important for its internal operations, and as your use of pandas increases you will eventually bump into index issues.

set_index

You can change the column used for the index using df.set_index, passing in the column(s) to use. Here we tell the dataframe to promote the column name to be the index.

df.index
RangeIndex(start=0, stop=4, step=1)

Index

Since we did not specify an index Pandas automatically created an index for us. Automatic indexes are always a RangeIndex from 0-len(df)

df.set_index('name')
age color weight
name
Lucy 2 gray 1.0
Bella 2 tabby 0.5
Lucy 3 white 2.0
Nala 5 black 2.0

You can set multiple columns as the index. Here we tell the dataframe to promote name, and age to be used as the index.

df1 = df.set_index(['name', 'age'])
df1
color weight
name age
Lucy 2 gray 1.0
Bella 2 tabby 0.5
Lucy 3 white 2.0
Nala 5 black 2.0

reset_index

The opposite of set_index - which promotes a column is reset_index, which demotes the index to be a regular column. Here for example, we set the name as the index, and then reset.

df2 = df.set_index('name')
df2
age color weight
name
Lucy 2 gray 1.0
Bella 2 tabby 0.5
Lucy 3 white 2.0
Nala 5 black 2.0

Resetting the index demotes name to being a regular column

df2.reset_index()
name age color weight
0 Lucy 2 gray 1.0
1 Bella 2 tabby 0.5
2 Lucy 3 white 2.0
3 Nala 5 black 2.0

reset_index(drop=True)

Since reset_index turns an index into a regular column, if the index was a auto generated range index, then that would become a column, which in most cases is not what you want. So, almost all the time you reset the index, you will drop the column. Su use reset_index(drop=True). This happens to be very common in my Python code, especially after grouping operations where pandas creates MultiIndexes for me which I generally do not use.

df.reset_index(drop=True)
name age color weight
0 Lucy 2 gray 1.0
1 Bella 2 tabby 0.5
2 Lucy 3 white 2.0
3 Nala 5 black 2.0

Pandas Copying

Most pandas operations like set_index returns a copy of the dataframe. This is a pretty standard design pattern and it allows for pandas to avoid issues that arise from mutating dataframes. For pandas returning copies is also faster as it avoids a lot of internal operations such as updating the index etc.

As an example of a copy beining returned from a pandas operation let's drop the weight column from df

df3 = df.copy()
df3['weight'] = [1.0,0.5,2.0,2.0]
df3.drop(columns=['weight'])
name age color
0 Lucy 2 gray
1 Bella 2 tabby
2 Lucy 3 white
3 Nala 5 black

Inspecting df shows that the weight column is still there.

df3
name age color weight
0 Lucy 2 gray 1.0
1 Bella 2 tabby 0.5
2 Lucy 3 white 2.0
3 Nala 5 black 2.0

To change the pandas behaviour from return copies, several pandas functions allow you to specify inplace=True. Now if we drop the column it will

df3.drop(columns=['weight'], inplace=True)
df3
name age color
0 Lucy 2 gray
1 Bella 2 tabby
2 Lucy 3 white
3 Nala 5 black

Reading Data

pandas can read rectangular data from almost anything. Here is a list - or rather, a dataframe containing all the pands read_* functions.

pd.DataFrame(data=[f for f in dir(pd) if f.startswith('read_')],
             columns=['function'])
function
0 read_clipboard
1 read_csv
2 read_excel
3 read_feather
4 read_fwf
5 read_gbq
6 read_hdf
7 read_html
8 read_json
9 read_orc
10 read_parquet
11 read_pickle
12 read_sas
13 read_spss
14 read_sql
15 read_sql_query
16 read_sql_table
17 read_stata
18 read_table

This post is part of a series. You might also be interested in

Learning Pandas Part 2: Reading csvs