What is Pycaret

Pycaret is a new machine learning library that simplifies machine learning workflows. It is a high level library that sits on top of the most popular ML libraries - including general purpose libraries such as scikit-learn, tree-based such as lightgbm, xgboost and catboost, and NLP libraries such as nltk, gensim and spacy. In a sense pycaret shifts the focus from the implementation of mathematical algorithms to the implementation of ML workflow algorithms. Some of pycaret's functions, such as setup, create_model, predict_model, plot_model, are meant to perform a part of a typical data science project, while abstracting away from how the actual libraries underneath operate.

In this sense it is a little bit like build tools Maven and Gradle, tools that did not really care how the underlying tools in your project operated but focused instead on the workflow of producing build artifacts at the end. It is more of an ML workflow library with a set of opinionated workflow algorithms that guide how a data science project is created.

If you recall Maven artifacts are written in the scripting language most familiar to developers at the time - XML. Jupyter notebooks have the same familiarity for data scientists now - most data scientists use them. Pycaret is unapologetically intended for use in Jupyter notebooks - several of the functions output data to be displayed in a notebook as the data science project is being created. So for example, - when you load using get_data - it displays the data that was loaded in the notebook. This makes use of the IPython.display. Pycaret also uses ipywidgets to show tabs or other UI elements useful to assist a data scientist while working on a data science project. This is familiar to me - I have written a few Jupyter first libraries of my own

Getting Started

To get started install pycaret. In the best case scenario it's a simple as

pip install pycaret

For more details go to the Pycaret Install page

Included Datasets

Pycaret includes several datasets to get started with. To use the included data, import the datasets package from pycaret.

from pycaret.datasets import get_data

Included with pycaret is a small dataset called juice. This small dataset has information about 1070 purchases of Citrus Hill or Minute Maid orange juice. It is originally meant for R projects and so is hosted on rdrr.io, a site where searching for R related packages and data. Because the data is small, it is conveniently included in the pycaret library.

To get the data use the get_data function. This conveniently also displays the data in the notebook.

juice_dataset = get_data('juice')

The data seems complete and mostly numeric. The only weird thing about the data is that Id column, which we do not need because pandas generates an index for us. So let's drop it.

if 'Id' in juice_dataset:
    juice_dataset = juice_dataset.set_index('Id')

Get Data Niceties

The get_data function has a couple of nice options. With save you can save a copy of the data in the local filesystem. With profile you can output a pandas profile report. You can see more details about get_data by running get_data?? in a code cell

Pandas Profiling Report

When you include the profile parameter get_data will output a pandas_profiling report created using the pandas_profile library. This gives a lot of information about the dataset and its features. Normally this helps in the data exploration phase of a data science project, but it's nice to also have it included in a function to get data.

Pandas Profiling Report

More about the orange juice dataset

We can find out more about the juice dataset by visiting the webpage at rdrr.io. Aternatively, we can grab the html from the site and convert the data description into a pandas dataframe. For this we will use a requests and BeautifulSoup. Here is that code.

#collapse
import pandas as pd
from bs4 import BeautifulSoup
import requests

pd.options.display.max_colwidth = 120

# 1. Grab the HTML from the website
r = requests.get('https://rdrr.io/cran/ISLR/man/OJ.html')

# 2. And convert it to a BeautifulSoup object
soup = BeautifulSoup(r.text)

def find_cells(soup, element):
    for dl in soup.find_all('dl'):
        for dt in dl.find_all('dt'):
            
            # 3. Once we find the <dt> element return it along with the <dd> element
            yield dt.text.strip(), dt.nextSibling.text.strip().replace('\n', ' ')
            
# 4. Convert the generator to a list of tuples
dt = list(find_cells(soup, 'dt'))

# 5. Now create a dataframe from the list of tuples
pd.DataFrame(dt, columns=['Feature','Description']).style.hide_index()

Preparing Data

First, we split the data into training and test set. It is a small dataset and maybe we can get by with an 80/20 split, especially since we will use tree based algorithms or similar algorithms that can handle small datasets.

train = juice_dataset.sample(frac=0.80)
unseen_data = juice_dataset.drop(train.index).reset_index(drop=True)

train.shape, test.shape

((856, 18), (214, 18))

The setup function

At the heart of Pycaret has a powerful machine learning pipeline that takes the input data and performs a huge part of the traditional data processing workflow. This pipeline is started by calling the setup function. setup initializes the environment, and creates the data transformation pipeline to prepare the data for model training.

It is a bit of a magic function - a lot of work is done inside this function and it has to be called before any of the pycaret model development is started. You call setup on your training data and tell it the target variable, and it will start preparing for model training. Importantly, it does some of the work that data scientists would be doing manually at this point, including making decisions about some of the preprocessing steps, such as apply PCA, or removing outliers etc. How well it does this is yet to be seen but it is certainly a big change to a data scientist's workflow if this can be done automatically. A lt of steps are done automatically by pycaret, as we will see later.

Regardless, setup displays a dataframe with information it discovers about the data's features and includes some of the decisions that were taken. This give a nice platform for ML modeling.

from pycaret.classification import *

juice_prep = setup(train, target='Purchase')

 
Setup Succesfully Completed!

Compare Models

Once the data is prepared, a data scientist would now choose among candidate models. Again, there is another magic function compare_models which does this for you. This runs several models - 15 as shown below - on the data , and display key metrics to allow ypu to choose among the model types for the next step.

compare_models()

Available Models

pd.read_csv('pycaret_models.csv').style.hide_index()

Create Model

Assuming we decide to use the Ridge Classifier, then the next function is intuitively create_model. This will create a Ridge Classifier model with default hyper parameters.

rc = create_model('ridge')

Tune Model

Instead of create_model, which uses default parameters and so might not have optimal performance, you can call tune_model. This will automatically tune the model's hyperparameters.

ridge_tuned = tune_model('ridge')

Plot Model

Another nice feature that pycaret provides is the ability to plot different charts that can tell you how your model performs. For our ridge classifier we can plot a Precision-Recall Curve using plot_model(plot='pr'). There are many different types of plots and it would be cool to try them out.

plot_model(ridge_tuned, plot = 'pr')

Predict

Finally, when we call predict_model pycaret runs the predictions and appends the labels to the original data.

unseen_predictions = predict_model(ridge_tuned, data=unseen_data)
unseen_predictions[['Purchase', 'Label']]

Conclusion

pycaret is fun to use and it is useful.

	Id	Purchase	WeekofPurchase	StoreID	PriceCH	PriceMM	DiscCH	DiscMM	SpecialMM	LoyalCH	SalePriceMM	SalePriceCH	PriceDiff	Store7	PctDiscMM	PctDiscCH	ListPriceDiff	STORE
0	1	CH	237	1	1.75	1.99	0.00	0.0	0	0.500000	1.99	1.75	0.24	No	0.000000	0.000000	0.24	1
1	2	CH	239	1	1.75	1.99	0.00	0.3	1	0.600000	1.69	1.75	-0.06	No	0.150754	0.000000	0.24	1
2	3	CH	245	1	1.86	2.09	0.17	0.0	0	0.680000	2.09	1.69	0.40	No	0.000000	0.091398	0.23	1
3	4	MM	227	1	1.69	1.69	0.00	0.0	0	0.400000	1.69	1.69	0.00	No	0.000000	0.000000	0.00	1
4	5	CH	228	7	1.69	1.69	0.00	0.0	0	0.956535	1.69	1.69	0.00	Yes	0.000000	0.000000	0.00	0

Feature	Description
Purchase	A factor with levels CH and MM indicating whether the customer purchased Citrus Hill or Minute Maid Orange Juice
WeekofPurchase	Week of purchase
StoreID	Store ID
PriceCH	Price charged for CH
PriceMM	Price charged for MM
DiscCH	Discount offered for CH
DiscMM	Discount offered for MM
SpecialCH	Indicator of special on CH
SpecialMM	Indicator of special on MM
LoyalCH	Customer brand loyalty for CH
SalePriceMM	Sale price for MM
SalePriceCH	Sale price for CH
PriceDiff	Sale price of MM less sale price of CH
Store7	A factor with levels No and Yes indicating whether the sale is at Store 7
PctDiscMM	Percentage discount for MM
PctDiscCH	Percentage discount for CH
ListPriceDiff	List price of MM less list price of CH
STORE	Which of 5 possible stores the sale occured at

	Description	Value
0	session_id	2574
1	Target Type	Binary
2	Label Encoded	CH: 0, MM: 1
3	Original Data	(856, 18)
4	Missing Values	False
5	Numeric Features	12
6	Categorical Features	5
7	Ordinal Features	False
8	High Cardinality Features	False
9	High Cardinality Method	None
10	Sampled Data	(856, 18)
11	Transformed Train Set	(599, 16)
12	Transformed Test Set	(257, 16)
13	Numeric Imputer	mean
14	Categorical Imputer	constant
15	Normalize	False
16	Normalize Method	None
17	Transformation	False
18	Transformation Method	None
19	PCA	False
20	PCA Method	None
21	PCA Components	None
22	Ignore Low Variance	False
23	Combine Rare Levels	False
24	Rare Level Threshold	None
25	Numeric Binning	False
26	Remove Outliers	False
27	Outliers Threshold	None
28	Remove Multicollinearity	False
29	Multicollinearity Threshold	None
30	Clustering	False
31	Clustering Iteration	None
32	Polynomial Features	False
33	Polynomial Degree	None
34	Trignometry Features	False
35	Polynomial Threshold	None
36	Group Features	False
37	Feature Selection	False
38	Features Selection Threshold	None
39	Feature Interaction	False
40	Feature Ratio	False
41	Interaction Threshold	None

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa
0	Extreme Gradient Boosting	0.826500	0.893900	0.771300	0.792000	0.779900	0.636900
1	Linear Discriminant Analysis	0.824800	0.904300	0.776000	0.785100	0.780100	0.634500
2	Logistic Regression	0.823100	0.901500	0.746800	0.801700	0.771100	0.627400
3	Ridge Classifier	0.819800	0.000000	0.767700	0.780500	0.773300	0.623700
4	Ada Boost Classifier	0.816500	0.884300	0.738300	0.795000	0.762900	0.613700
5	Gradient Boosting Classifier	0.816500	0.893800	0.758800	0.780100	0.767000	0.615800
6	Random Forest Classifier	0.803100	0.869800	0.705200	0.787600	0.741200	0.583200
7	CatBoost Classifier	0.803100	0.892100	0.742200	0.762700	0.747500	0.586800
8	Light Gradient Boosting Machine	0.796400	0.886400	0.721200	0.762500	0.737000	0.571600
9	Extra Trees Classifier	0.771300	0.846400	0.696700	0.726200	0.707900	0.520500
10	Naive Bayes	0.769700	0.837100	0.775800	0.690400	0.729000	0.530000
11	Decision Tree Classifier	0.769600	0.759200	0.696700	0.721200	0.706100	0.517100
12	K Neighbors Classifier	0.702700	0.740700	0.563500	0.653100	0.600400	0.366900
13	Quadratic Discriminant Analysis	0.656200	0.699000	0.599200	0.594300	0.565100	0.292900
14	SVM - Linear Kernel	0.535700	0.000000	0.616700	0.375700	0.420200	0.104500

Estimator	Abbreviated String	Original Implementation
Logistic Regression	lr	linear_model.LogisticRegression
K Nearest Neighbour	knn	neighbors.KNeighborsClassifier
Naives Bayes	nb	naive_bayes.GaussianNB
Decision Tree	dt	tree.DecisionTreeClassifier
SVM (Linear)	svm	linear_model.SGDClassifier
SVM (RBF)	rbfsvm	svm.SVC
Gaussian Process	gpc	gaussian_process.GPC
Multi Level Perceptron	mlp	neural_network.MLPClassifier
Ridge Classifier	ridge	linear_model.RidgeClassifier
Random Forest	rf	ensemble.RandomForestClassifier
Quadratic Disc. Analysis	qda	discriminant_analysis.QDA
AdaBoost	ada	ensemble.AdaBoostClassifier
Gradient Boosting	gbc	ensemble.GradientBoostingClassifier
Linear Disc. Analysis	lda	discriminant_analysis.LDA
Extra Trees Classifier	et	ensemble.ExtraTreesClassifier
Extreme Gradient Boosting	xgboost	xgboost.readthedocs.io
Light Gradient Boosting	lightgbm	github.com/microsoft/LightGBM
CatBoost Classifier	catboost	https://catboost.ai

	Accuracy	Recall	Prec.	F1	Kappa
0	0.8000	0.7500	0.7500	0.7500	0.5833
1	0.8000	0.7500	0.7500	0.7500	0.5833
2	0.8667	0.8333	0.8333	0.8333	0.7222
3	0.7833	0.6667	0.7619	0.7111	0.5390
4	0.8500	0.8333	0.8000	0.8163	0.6897
5	0.8167	0.7083	0.8095	0.7556	0.6099
6	0.8833	0.8750	0.8400	0.8571	0.7586
7	0.7500	0.6667	0.6957	0.6809	0.4755
8	0.7833	0.7600	0.7308	0.7451	0.5568
9	0.8644	0.8333	0.8333	0.8333	0.7190
Mean	0.8198	0.7677	0.7805	0.7733	0.6237
SD	0.0418	0.0700	0.0471	0.0554	0.0885

	Purchase	Label
0	MM	1
1	CH	0
2	CH	0
3	MM	1
4	CH	0
...	...	...
209	CH	0
210	MM	1
211	CH	0
212	MM	1
213	MM	0