Learn Pandas Part 1 - The Basics
In this post we learn the basics of pandas dataframes
import pandas as pd
Key Ideas
There are two primary objects - DataFrame and Series. Most commonly, you will work with dataframes, which are similar to excel worksheets or database tables. You can think of a DataFrame as a rectangular data structure made up of a number of columns. Each of the columns would then be a Series.
Every dataframe or series has an Index - which is a special column. If you create a dataframe without specifying an Index, pandas will create a numeric index of range(0, len(df))
. Series in the same dataframe share the same index.
Let's create three lists that contain information about cats
names = ['Lucy', 'Bella','Lucy','Nala']
ages = [2,2,3,5]
colors = ['gray', 'tabby', 'white', 'black']
Now we create a dataframe where each of the lists become a column, or in pandas terminology - a series.
df = pd.DataFrame({'name': names, 'age': ages, 'color':colors})
df
df.values
This is a numpy array with 4 rows and 3 columns
type(df.values)
You can see the shape of the dataframe by calling df.shape
. This delegates to a call to the numpy array's shape. Generally, operations on a pandas dataframe result in operations on the underlying numpy array.
df.shape, df.values.shape
df.columns
A column can be accessed by passing the column name into the bracket operator e.g. using df[col]
. So to get the color column, use df['color']
df['color']
And to get the age column use df['age']
. Notice that both the color and age column include the index values [0,1,2,3,4]
- because all columns in a dataframe share the same index.
df['age']
When you select a single column from a Dataframe you get a series. If you want to get a new dataframe with just one column use a column list with one item e.g. df[['age']]
instead of df['age']
df[['age']]
df.age
df[['age', 'color']]
Notice the double brackets [[]]
? The outer brackets are used for selections and the inner brackets are the list of columns. The code above is equivalent to this
age_and_color = ['age', 'color']
df[age_and_color]
When you select multiple columns from a dataframe, you get new dataframe with just those columns.
df['weight'] = [1.0,0.5,2.0,2.0]
df
df.drop(columns=['weight'])
Indexes
An absolute beginner to pandas may not need to know about indexes, especially since pandas creates indexes autmatically for you. However considers indexes to be important for its internal operations, and as your use of pandas increases you will eventually bump into index issues.
set_index
You can change the column used for the index using df.set_index
, passing in the column(s) to use. Here we tell the dataframe to promote the column name to be the index.
df.index
df.set_index('name')
You can set multiple columns as the index. Here we tell the dataframe to promote name, and age to be used as the index.
df1 = df.set_index(['name', 'age'])
df1
df2 = df.set_index('name')
df2
Resetting the index demotes name to being a regular column
df2.reset_index()
reset_index(drop=True)
Since reset_index
turns an index into a regular column, if the index was a auto generated range index, then that would become a column, which in most cases is not what you want. So, almost all the time you reset the index, you will drop the column. Su use reset_index(drop=True)
. This happens to be very common in my Python code, especially after grouping operations where pandas creates MultiIndexes for me which I generally do not use.
df.reset_index(drop=True)
Pandas Copying
Most pandas operations like set_index
returns a copy of the dataframe. This is a pretty standard design pattern and it allows for pandas to avoid issues that arise from mutating dataframes. For pandas returning copies is also faster as it avoids a lot of internal operations such as updating the index etc.
As an example of a copy beining returned from a pandas operation let's drop the weight column from df
df3 = df.copy()
df3['weight'] = [1.0,0.5,2.0,2.0]
df3.drop(columns=['weight'])
Inspecting df shows that the weight column is still there.
df3
To change the pandas behaviour from return copies, several pandas functions allow you to specify inplace=True
. Now if we drop the column it will
df3.drop(columns=['weight'], inplace=True)
df3
pd.DataFrame(data=[f for f in dir(pd) if f.startswith('read_')],
columns=['function'])
This post is part of a series. You might also be interested in