Pandas Logo

Getting Started

Pandas is a python library that you install using pip

pip install pandas

or using conda

conda install pandas Most users should install the latest version though for enterprise data projects it is often important to pin to a specific pandas version to ensure API stability.

import pandas as pd

Key Ideas

There are two primary objects - DataFrame and Series. Most commonly, you will work with dataframes, which are similar to excel worksheets or database tables. You can think of a DataFrame as a rectangular data structure made up of a number of columns. Each of the columns would then be a Series.

Every dataframe or series has an Index - which is a special column. If you create a dataframe without specifying an Index, pandas will create a numeric index of range(0, len(df)). Series in the same dataframe share the same index.

Let's create three lists that contain information about cats

names = ['Lucy', 'Bella','Lucy','Nala']
ages = [2,2,3,5]
colors = ['gray', 'tabby', 'white', 'black']

Now we create a dataframe where each of the lists become a column, or in pandas terminology - a series.

df = pd.DataFrame({'name': names, 'age': ages, 'color':colors})
df

Values

The 3 individual data columns - name, age and color have been combined into a multidimensional array called values.

df.values

array([['Lucy', 2, 'gray'],
       ['Bella', 2, 'tabby'],
       ['Lucy', 3, 'white'],
       ['Nala', 5, 'black']], dtype=object)

This is a numpy array with 4 rows and 3 columns

type(df.values)

numpy.ndarray

You can see the shape of the dataframe by calling df.shape. This delegates to a call to the numpy array's shape. Generally, operations on a pandas dataframe result in operations on the underlying numpy array.

df.shape, df.values.shape

((4, 3), (4, 3))

Columns

You can see the list of columns in the dataframe by calling df.columns. The result for our dataframe are the names we specified as the keys to the dictionary we passed in when we created the dataframe.

df.columns

Index(['name', 'age', 'color'], dtype='object')

Selecting columns

A column can be accessed by passing the column name into the bracket operator e.g. using df[col]. So to get the color column, use df['color']

df['color']

0     gray
1    tabby
2    white
3    black
Name: color, dtype: object

And to get the age column use df['age']. Notice that both the color and age column include the index values [0,1,2,3,4] - because all columns in a dataframe share the same index.

df['age']

0    2
1    2
2    3
3    5
Name: age, dtype: int64

When you select a single column from a Dataframe you get a series. If you want to get a new dataframe with just one column use a column list with one item e.g. df[['age']] instead of df['age']

df[['age']]

Selecting columns - as a property

When a dataframe is created, it's columns are exposed as properties of that dataframe object, so it's possible to use the dot . operator

df.age

0    2
1    2
2    3
3    5
Name: age, dtype: int64

Selecting multiple columns

To select multiple columns pass in a list with the names of the columns.

df[['age', 'color']]

Notice the double brackets [[]]? The outer brackets are used for selections and the inner brackets are the list of columns. The code above is equivalent to this

age_and_color = ['age', 'color']
df[age_and_color]

When you select multiple columns from a dataframe, you get new dataframe with just those columns.

Adding columns

To add a new column use the same bracket access operator [] as for selecting a dataframe

df['weight'] = [1.0,0.5,2.0,2.0]
df

Droping columns

To drop columns pass the names of columns to be dropped in a list

df.drop(columns=['weight'])

Indexes

An absolute beginner to pandas may not need to know about indexes, especially since pandas creates indexes autmatically for you. However considers indexes to be important for its internal operations, and as your use of pandas increases you will eventually bump into index issues.

set_index

You can change the column used for the index using df.set_index, passing in the column(s) to use. Here we tell the dataframe to promote the column name to be the index.

df.index

RangeIndex(start=0, stop=4, step=1)

Index

Since we did not specify an index Pandas automatically created an index for us. Automatic indexes are always a RangeIndex from 0-len(df)

df.set_index('name')

You can set multiple columns as the index. Here we tell the dataframe to promote name, and age to be used as the index.

df1 = df.set_index(['name', 'age'])
df1

reset_index

The opposite of set_index - which promotes a column is reset_index, which demotes the index to be a regular column. Here for example, we set the name as the index, and then reset.

df2 = df.set_index('name')
df2

Resetting the index demotes name to being a regular column

df2.reset_index()

reset_index(drop=True)

Since reset_index turns an index into a regular column, if the index was a auto generated range index, then that would become a column, which in most cases is not what you want. So, almost all the time you reset the index, you will drop the column. Su use reset_index(drop=True). This happens to be very common in my Python code, especially after grouping operations where pandas creates MultiIndexes for me which I generally do not use.

df.reset_index(drop=True)

Pandas Copying

Most pandas operations like set_index returns a copy of the dataframe. This is a pretty standard design pattern and it allows for pandas to avoid issues that arise from mutating dataframes. For pandas returning copies is also faster as it avoids a lot of internal operations such as updating the index etc.

As an example of a copy beining returned from a pandas operation let's drop the weight column from df

df3 = df.copy()
df3['weight'] = [1.0,0.5,2.0,2.0]
df3.drop(columns=['weight'])

Inspecting df shows that the weight column is still there.

df3

To change the pandas behaviour from return copies, several pandas functions allow you to specify inplace=True. Now if we drop the column it will

df3.drop(columns=['weight'], inplace=True)
df3

Reading Data

pandas can read rectangular data from almost anything. Here is a list - or rather, a dataframe containing all the pands read_* functions.

pd.DataFrame(data=[f for f in dir(pd) if f.startswith('read_')],
             columns=['function'])

This post is part of a series. You might also be interested in

Learning Pandas Part 2: Reading csvs

	function
0	read_clipboard
1	read_csv
2	read_excel
3	read_feather
4	read_fwf
5	read_gbq
6	read_hdf
7	read_html
8	read_json
9	read_orc
10	read_parquet
11	read_pickle
12	read_sas
13	read_spss
14	read_sql
15	read_sql_query
16	read_sql_table
17	read_stata
18	read_table

Learn Pandas Part 1 - The Basics