Pandas CSV

Overview

Pandas is a container for rectangular data but it is astonishingly flexible in what you can put inside.

You can put images inside a dataframe - if you conceive an image being a rectangular block of numbers. You can read a word document into a pandas dataframe - with each paragraph occupying a cell. Even HTML can be read into a dataframe - or at least HTML tables - which are extremely common. And with this flexibility in what a dataframe can contain also comes severel ways to get data inside.

Naturally we start with import pandas as pd

import pandas as pd

Read Functions

Pandas has several functions for reading data. To view these functions, let's list all the functions that starts with read_

Tip: Use dir(obj) on a python object to view what's inside

[a for a in dir(pd) if a.startswith('read')]
['read_clipboard',
 'read_csv',
 'read_excel',
 'read_feather',
 'read_fwf',
 'read_gbq',
 'read_hdf',
 'read_html',
 'read_json',
 'read_orc',
 'read_parquet',
 'read_pickle',
 'read_sas',
 'read_spss',
 'read_sql',
 'read_sql_query',
 'read_sql_table',
 'read_stata',
 'read_table']

There are at least 19 read functions from a wide variety of data sources. Youcan read data from the clipboard, or from html, or from sql databases.

Tip: use ? at the end of a function to see what it can do

pd.read_sql_table?

Reading CSVs

99% of the time I use pd.read_csv since it 99% of the time you are doing a data science project the data is in a comma separated value file. I say 99%, mainly to make a point, but this holds for projects that are shared via the web. Inside companies, CSVs are still popular, but you are also more likely to find that data is in databases or json files or excel sheets. Nevertheless, CSVs are very commonly used, and so pd.read_csv is your go to means of reading data.

pd.read_csv is a pretty versatile function, and that versatility comes from being based on an even more versatile function pd.read_table. Mostly, you can use either pd.read_csv or pd.read_table interchangeably, except that pd.read_csv is more focused on CSVs.

Where to get CSVs

CSVs are easy to find. Here are a few sources

Reading a CSV File

Firstly, d.read_csv works identically on a local file as it does with a url. To illustrate, we found a CSV file on Data.gov about runways in Notrh Dakota, and we can load it directly with the url, or with the local file after downloading. Therefore - these are file paths are identical.

runways_in_north_dakota_url = 'https://gishubdata.nd.gov/sites/default/files/NDHUB.AirportRunways.csv'
runways_in_north_dakota_file = 'data/pandas/NDHUB.AirportRunways.csv'

Now we can read the data into a dataframe and then display it

nd_runways = pd.read_csv(runways_in_north_dakota_url)
nd_runways
OBJECTID LOCID SITENO FULLNAME FAA_ST UNIQUE_ID RWY_ID RWY_LEN RWY_WITH RWY_SFC_TY RWY_SFC_TR PAV_CLASS_ RWY_LIGHTS STFIPS ST_POSTAL VERSION SHAPE.LEN
0 1 ND02 17245.8*A RICKETYBACK FIELD ND 17245.8*A ND17/35 17/35 2600 100 TURF LOW 38 ND 9 0.007454
1 2 1A2 17249.1*A ARTHUR ND 17249.1*A ND17/35 17/35 3100 85 TURF-F NSTD 38 ND 9 0.008500
2 3 ASY 17253.*A ASHLEY MUNI ND 17253.*A ND08/26 08/26 2825 150 TURF-F 38 ND 9 0.011122
3 4 ASY 17253.*A ASHLEY MUNI ND 17253.*A ND14/32 14/32 4300 60 ASPH-G LOW 38 ND 9 0.013180
4 5 20U 17256.1*A BEACH ND 17256.1*A ND12/30 12/30 4200 60 ASPH-G AFSC MED 38 ND 9 0.014550
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
126 51 7N6 17353.*A GRENORA CENTENNIAL ND 17353.*A ND17/35 17/35 2600 100 TURF-F 38 ND 9 0.007128
127 128 ISN 17496.1*A SLOULIN FLD INTL ND 17496.1*A ND02/20 02/20 3453 60 ASPH-F MED 38 ND 9 0.010863
128 129 ISN 17496.1*A SLOULIN FLD INTL ND 17496.1*A ND11/29 11/29 6650 100 ASPH-F PFC MED 38 ND 9 0.024758
129 130 6L5 17503.*A WISHEK MUNI ND 17503.*A ND14/32 14/32 3450 60 ASPH-G NSTD 38 ND 9 0.010827
130 131 ND26 17244.3*A GAJEWSKI FLD ND 17244.3*A ND18/36 18/36 2300 40 TURF LOW 38 ND 9 0.005000

131 rows × 17 columns

So now you can see the value and the power of using pandas for reading csvs - with just a line of code. And since there is a a lot of valuable data available in CSVs, knowing how to read this into pandas for processing is a valuable skill indeed.

CSV data can get pretty varied you will need to know how to use some of the parameters of read_csv depending on the data you are reading. As noted before, you can use ? at the end of a function in a Jupyter notebook to display documentation. Here we do something slightly different - we want to load that documentation into a pandas DataFrame and display it. We will use pd.read_csv.__doc__ to get the documentation.

Tip: You can use __doc__ to access the documentation for a python function or object

Pandas Read CSV Documentation

The function documentation isn't exactly rectangular data - but a data scientist takes data from one shape to another. The code gets the docstring, does some stripping and splitting until we have a below does this.

Note: this code won't work for all docstring since the format varies but it works for this function

read_csv_parameters = pd.DataFrame(      # Create a Dataframe
    
    [line.split(':') for line            # For every line split each by colon
      in pd.read_csv.__doc__.split('\n') # Get the docstring and split into lines
       if ' :' in line                   # Keep if " :" is in the line
         and not line.startswith(' ')    # And there is no leading spaces
         and ',' in line],               # And there is a comma in the line
         
    columns=['parameter','description']  # Set the column names
).applymap(lambda x: x.strip())          # Strip all spaces from the dataframe

Concepts covered in the code above

  • list comprehensions
  • filters in list comprehensions
  • Splitting strings
  • Creating a dataframe from a list of lists
  • Using applymap to strip spaces from all cells of a dataframe

So now we have the dataframe, the code below makes it look a little nicer by using the dataframe styler. I use it to highlight the parameters that I use, depending on the need.

Tip: learn how to use dataframe styling .. you'll probably get paid more :-)

params_to_highlight = ['filepath_or_buffer','delimiter', 'usecols','dtype',
              'sep', 'nrows','skiprows', 'low_memory', 'names']

def highight_parameter(param):
    is_key = param.isin(params_to_highlight)
    return ['background-color: #a7b9f1; font-weight:bold' if v else '' for v in is_key]

read_csv_parameters.style.hide_index().apply(highight_parameter, axis=1)      
parameter description
filepath_or_buffer str, path object or file-like object
sep str, default ','
delimiter str, default ``None``
header int, list of int, default 'infer'
names array-like, optional
index_col int, str, sequence of int / str, or False, default ``None``
usecols list-like or callable, optional
squeeze bool, default False
prefix str, optional
mangle_dupe_cols bool, default True
dtype Type name or dict of column -> type, optional
engine {'c', 'python'}, optional
converters dict, optional
true_values list, optional
false_values list, optional
skipinitialspace bool, default False
skiprows list-like, int or callable, optional
skipfooter int, default 0
nrows int, optional
na_values scalar, str, list-like, or dict, optional
keep_default_na bool, default True
na_filter bool, default True
verbose bool, default False
skip_blank_lines bool, default True
parse_dates bool or list of int or names or list of lists or dict, default False
infer_datetime_format bool, default False
keep_date_col bool, default False
date_parser function, optional
dayfirst bool, default False
cache_dates bool, default True
iterator bool, default False
chunksize int, optional
compression {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
thousands str, optional
decimal str, default '.'
lineterminator str (length 1), optional
quotechar str (length 1), optional
quoting int or csv.QUOTE_* instance, default 0
doublequote bool, default ``True``
escapechar str (length 1), optional
comment str, optional
encoding str, optional
dialect str or csv.Dialect, optional
error_bad_lines bool, default True
warn_bad_lines bool, default True
delim_whitespace bool, default False
low_memory bool, default True
memory_map bool, default False
float_precision str, optional

Common Parameters

parameter description
filepath_or_buffer The name or url of the file. You always need this
sep What separates the cells, if not ","
delimiter What separates the cells, if not ",". Alias for sep
names The column names to use for all the columns in the CSV
usecols Which of the columns to use
dtype Use a dict to pass in the datatypes for each column. Pretty useful
skiprows How many rows to skip at the start of the file
nrows How many rows to use in total
low_memory If the file is large, an you have lots of memory, use low_memory=False

Conclusion

This was a brief walk through pd.read_csv. which we saw is pretty useful for reading data into pandas from CSVs. We looked at how to get public data to practice reading CSVs and also how to use teh documentation to learn more.

This post is part of a series. You might also be interested in

Learning Pandas Part 1: The Basics