Creating a Dataset for Classification
In this article we learn how to create datasets for machine learning classification
Machine learning algorithms are fairly easy to use if you have data that matches exactly what the algorithms are designed for. To get going with an ML project with generic data, you could generate the data in the exact shape for your ML project. For example, if you want to do classification, then you will need a target columns with the predicted classes, and you will need a dataset with the features that could predict these classes.
If your machine learning project is in Python, then the best way to start is with scikit-learn. This easy to use yet powerful library also has convenience functions to generate test data, one of which is called make_classification.
make_classification
Scikit-learn has a utility function to generate test data for classification called make_classification
. With it you can generate a numpy array with features along with another array with predicted classes. This function is in the datasets package so to use it you would do
from sklearn.datasets import make_classification
data, target = make_classification(...)
and you will get the data and the target with some relationship between the two sufficient to do some machine learning. Here is an example
First, set the random state
random_state=2
from sklearn.datasets import make_classification
data, target = make_classification(n_features=12, n_samples=100, random_state=random_state)
The data array is a numpy array of shape (n_samples, n_features)
data[:2], data.shape
The target is a numpy array of shape (n_samples). The values will be 0 or 1 because by default n_classes is 2
target, target.shape
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(data, target)
Predicting an output label
After training we predict by passing in a data array. For simplicity we just choose one of the data values and we get a predicted label. This is not groundbreaking machine learning, but it shows how to quickly get a dataset that you can use to try different machine learning algorithms.
clf.predict([data[90]])
Create a classification dataframe
The downside of the make_classification
function is that it create numpy arrays without meaningful feature names. The bigger problem is that the features have different aspects - some are informative, others are redundant, or simply plain noise, without any indicator of which is which. To improve on it you can create dataframes, which allow for meaningful names that help in analysis and explainability.
from sklearn.datasets import make_classification
import pandas as pd
from datetime import datetime
def make_dataset(n_samples=1000, n_features=10, n_informative=6, n_redundant=2,
n_classes=2, weights=None, random_state=2):
data, target = make_classification(n_features=n_features,
n_informative=n_informative,
n_redundant=n_redundant,
n_samples=n_samples,
n_classes=n_classes,
weights=weights,
random_state=random_state,
shuffle=False)
index = pd.date_range(periods=n_samples, freq=pd.tseries.offsets.BDay(),
end=datetime.today()).normalize()
columns = [f'Info{i}' for i in range(n_informative)] + \
[f'Redun{i}' for i in range(n_redundant)] + \
[f'Noise{i}' for i in range(n_features - (n_informative + n_redundant))]
df = pd.DataFrame(data, columns=columns, index=index).round(3)
target = pd.Series(target, index=index)
return df, target
data, target = make_dataset(1000, n_features=8, n_informative=4,
n_classes=3, random_state=random_state)
The features
The generated dataframe contains 8 columns.
- Informative features These are informative features, meaning features that have a predictive relationship with the target
- Redundant features These features are generated as linear random combinations of the informative features
- Noise These are just noise, and should have no predictive power
data
pd.DataFrame(target, columns=['Target']).Target.value_counts().to_frame().sort_index()
Redundant Variables
If we plot the redundant variables we can see that it is a linear relationship to the informative variables. They can be safely dropped from the input features to a machine learning model, or otherwise handled in a special way. Of course, with real empirical data, you would not necessarily know that beforehand but would learn it during data exploration.
import altair as alt
alt.Chart(data).mark_circle().encode(
x='Info0',
y='Redun1'
).properties(
title='Informative vs Redundant Variables'
)
Informative Variables
The relationship between the informative variables and the target variables are a trickier to display on a chart because the target variable is categorical but if we plot we can see a relationship between these variables and the target.
To make it easier to see the relationship we bin the informative variables and set the size of the marker to the count in each bin.
df = data.copy()
df['Target'] = target
alt.Chart(df).mark_circle().encode(
alt.X('Info0', bin=True),
alt.Y('Target', bin=True),
size='count()'
).properties(
title='Info1 vs Target'
)
alt.Chart(df).mark_circle().encode(
alt.X('Noise0', bin=True),
alt.Y('Target', bin=True),
size='count()'
).properties(
title='Noise vs Target'
)
clf = RandomForestClassifier(min_samples_split=4)
clf.fit(data, target)
feature_importance = pd.DataFrame({'importance': clf.feature_importances_,
'feature': data.columns}).round(2)
alt.Chart(feature_importance).mark_bar().encode(
y='feature',
x='importance:Q'
).properties(
title='Feature Importance',
height=240
)
As expected, the noise variables have the least importance.