Pandas CSV

Overview

Pandas is a container for rectangular data but it is astonishingly flexible in what you can put inside.

You can put images inside a dataframe - if you conceive an image being a rectangular block of numbers. You can read a word document into a pandas dataframe - with each paragraph occupying a cell. Even HTML can be read into a dataframe - or at least HTML tables - which are extremely common. And with this flexibility in what a dataframe can contain also comes severel ways to get data inside.

Naturally we start with import pandas as pd

import pandas as pd

Read Functions

Pandas has several functions for reading data. To view these functions, let's list all the functions that starts with read_

Tip: Use `dir(obj)` on a python object to view what's inside

[a for a in dir(pd) if a.startswith('read')]

['read_clipboard',
 'read_csv',
 'read_excel',
 'read_feather',
 'read_fwf',
 'read_gbq',
 'read_hdf',
 'read_html',
 'read_json',
 'read_orc',
 'read_parquet',
 'read_pickle',
 'read_sas',
 'read_spss',
 'read_sql',
 'read_sql_query',
 'read_sql_table',
 'read_stata',
 'read_table']

There are at least 19 read functions from a wide variety of data sources. Youcan read data from the clipboard, or from html, or from sql databases.

Tip: use ? at the end of a function to see what it can do

pd.read_sql_table?

Reading CSVs

99% of the time I use pd.read_csv since it 99% of the time you are doing a data science project the data is in a comma separated value file. I say 99%, mainly to make a point, but this holds for projects that are shared via the web. Inside companies, CSVs are still popular, but you are also more likely to find that data is in databases or json files or excel sheets. Nevertheless, CSVs are very commonly used, and so pd.read_csv is your go to means of reading data.

pd.read_csv is a pretty versatile function, and that versatility comes from being based on an even more versatile function pd.read_table. Mostly, you can use either pd.read_csv or pd.read_table interchangeably, except that pd.read_csv is more focused on CSVs.

Where to get CSVs

CSVs are easy to find. Here are a few sources

Reading a CSV File

Firstly, d.read_csv works identically on a local file as it does with a url. To illustrate, we found a CSV file on Data.gov about runways in Notrh Dakota, and we can load it directly with the url, or with the local file after downloading. Therefore - these are file paths are identical.

runways_in_north_dakota_url = 'https://gishubdata.nd.gov/sites/default/files/NDHUB.AirportRunways.csv'
runways_in_north_dakota_file = 'data/pandas/NDHUB.AirportRunways.csv'

Now we can read the data into a dataframe and then display it

nd_runways = pd.read_csv(runways_in_north_dakota_url)
nd_runways

So now you can see the value and the power of using pandas for reading csvs - with just a line of code. And since there is a a lot of valuable data available in CSVs, knowing how to read this into pandas for processing is a valuable skill indeed.

CSV data can get pretty varied you will need to know how to use some of the parameters of read_csv depending on the data you are reading. As noted before, you can use ? at the end of a function in a Jupyter notebook to display documentation. Here we do something slightly different - we want to load that documentation into a pandas DataFrame and display it. We will use pd.read_csv.__doc__ to get the documentation.

Tip: You can use `doc` to access the documentation for a python function or object

Pandas Read CSV Documentation

The function documentation isn't exactly rectangular data - but a data scientist takes data from one shape to another. The code gets the docstring, does some stripping and splitting until we have a below does this.

Note: this code won't work for all docstring since the format varies but it works for this function

read_csv_parameters = pd.DataFrame(      # Create a Dataframe
    
    [line.split(':') for line            # For every line split each by colon
      in pd.read_csv.__doc__.split('\n') # Get the docstring and split into lines
       if ' :' in line                   # Keep if " :" is in the line
         and not line.startswith(' ')    # And there is no leading spaces
         and ',' in line],               # And there is a comma in the line
         
    columns=['parameter','description']  # Set the column names
).applymap(lambda x: x.strip())          # Strip all spaces from the dataframe

Concepts covered in the code above

list comprehensions
filters in list comprehensions
Splitting strings
Creating a dataframe from a list of lists
Using applymap to strip spaces from all cells of a dataframe

So now we have the dataframe, the code below makes it look a little nicer by using the dataframe styler. I use it to highlight the parameters that I use, depending on the need.

Tip: learn how to use dataframe styling .. you'll probably get paid more :-)

params_to_highlight = ['filepath_or_buffer','delimiter', 'usecols','dtype',
              'sep', 'nrows','skiprows', 'low_memory', 'names']

def highight_parameter(param):
    is_key = param.isin(params_to_highlight)
    return ['background-color: #a7b9f1; font-weight:bold' if v else '' for v in is_key]

read_csv_parameters.style.hide_index().apply(highight_parameter, axis=1)

Common Parameters

parameter	description
filepath_or_buffer	The name or url of the file. You always need this
sep	What separates the cells, if not ","
delimiter	What separates the cells, if not ",". Alias for sep
names	The column names to use for all the columns in the CSV
usecols	Which of the columns to use
dtype	Use a dict to pass in the datatypes for each column. Pretty useful
skiprows	How many rows to skip at the start of the file
nrows	How many rows to use in total
low_memory	If the file is large, an you have lots of memory, use low_memory=False

Conclusion

This was a brief walk through pd.read_csv. which we saw is pretty useful for reading data into pandas from CSVs. We looked at how to get public data to practice reading CSVs and also how to use teh documentation to learn more.

This post is part of a series. You might also be interested in

Learning Pandas Part 1: The Basics

	OBJECTID	LOCID	SITENO	FULLNAME	FAA_ST	UNIQUE_ID	RWY_ID	RWY_LEN	RWY_WITH	RWY_SFC_TY	RWY_SFC_TR	PAV_CLASS_	RWY_LIGHTS	STFIPS	ST_POSTAL	VERSION	SHAPE.LEN
0	1	ND02	17245.8*A	RICKETYBACK FIELD	ND	17245.8*A ND17/35	17/35	2600	100	TURF			LOW	38	ND	9	0.007454
1	2	1A2	17249.1*A	ARTHUR	ND	17249.1*A ND17/35	17/35	3100	85	TURF-F			NSTD	38	ND	9	0.008500
2	3	ASY	17253.*A	ASHLEY MUNI	ND	17253.*A ND08/26	08/26	2825	150	TURF-F				38	ND	9	0.011122
3	4	ASY	17253.*A	ASHLEY MUNI	ND	17253.*A ND14/32	14/32	4300	60	ASPH-G			LOW	38	ND	9	0.013180
4	5	20U	17256.1*A	BEACH	ND	17256.1*A ND12/30	12/30	4200	60	ASPH-G	AFSC		MED	38	ND	9	0.014550
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
126	51	7N6	17353.*A	GRENORA CENTENNIAL	ND	17353.*A ND17/35	17/35	2600	100	TURF-F				38	ND	9	0.007128
127	128	ISN	17496.1*A	SLOULIN FLD INTL	ND	17496.1*A ND02/20	02/20	3453	60	ASPH-F			MED	38	ND	9	0.010863
128	129	ISN	17496.1*A	SLOULIN FLD INTL	ND	17496.1*A ND11/29	11/29	6650	100	ASPH-F	PFC		MED	38	ND	9	0.024758
129	130	6L5	17503.*A	WISHEK MUNI	ND	17503.*A ND14/32	14/32	3450	60	ASPH-G			NSTD	38	ND	9	0.010827
130	131	ND26	17244.3*A	GAJEWSKI FLD	ND	17244.3*A ND18/36	18/36	2300	40	TURF			LOW	38	ND	9	0.005000

parameter	description
filepath_or_buffer	str, path object or file-like object
sep	str, default ','
delimiter	str, default ``None``
header	int, list of int, default 'infer'
names	array-like, optional
index_col	int, str, sequence of int / str, or False, default ``None``
usecols	list-like or callable, optional
squeeze	bool, default False
prefix	str, optional
mangle_dupe_cols	bool, default True
dtype	Type name or dict of column -> type, optional
engine	{'c', 'python'}, optional
converters	dict, optional
true_values	list, optional
false_values	list, optional
skipinitialspace	bool, default False
skiprows	list-like, int or callable, optional
skipfooter	int, default 0
nrows	int, optional
na_values	scalar, str, list-like, or dict, optional
keep_default_na	bool, default True
na_filter	bool, default True
verbose	bool, default False
skip_blank_lines	bool, default True
parse_dates	bool or list of int or names or list of lists or dict, default False
infer_datetime_format	bool, default False
keep_date_col	bool, default False
date_parser	function, optional
dayfirst	bool, default False
cache_dates	bool, default True
iterator	bool, default False
chunksize	int, optional
compression	{'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
thousands	str, optional
decimal	str, default '.'
lineterminator	str (length 1), optional
quotechar	str (length 1), optional
quoting	int or csv.QUOTE_* instance, default 0
doublequote	bool, default ``True``
escapechar	str (length 1), optional
comment	str, optional
encoding	str, optional
dialect	str or csv.Dialect, optional
error_bad_lines	bool, default True
warn_bad_lines	bool, default True
delim_whitespace	bool, default False
low_memory	bool, default True
memory_map	bool, default False
float_precision	str, optional