Machine learning algorithms are fairly easy to use if you have data that matches exactly what the algorithms are designed for. To get going with an ML project with generic data, you could generate the data in the exact shape for your ML project. For example, if you want to do classification, then you will need a target columns with the predicted classes, and you will need a dataset with the features that could predict these classes.

If your machine learning project is in Python, then the best way to start is with scikit-learn. This easy to use yet powerful library also has convenience functions to generate test data, one of which is called make_classification.

make_classification

Scikit-learn has a utility function to generate test data for classification called make_classification. With it you can generate a numpy array with features along with another array with predicted classes. This function is in the datasets package so to use it you would do

from sklearn.datasets import make_classification

data, target = make_classification(...)

and you will get the data and the target with some relationship between the two sufficient to do some machine learning. Here is an example

First, set the random state

random_state=2
from sklearn.datasets import make_classification

data, target = make_classification(n_features=12, n_samples=100, random_state=random_state)

The data array is a numpy array of shape (n_samples, n_features)

data[:2], data.shape
(array([[ 0.65755125, -0.73564052, -0.25712497,  2.16246241, -0.46323032,
          0.50442818, -0.1369783 , -2.42825346, -0.49282081, -0.64920516,
          0.27511225, -0.45730883],
        [ 0.54894656, -0.07663956, -0.08224538, -0.15972413,  1.70937948,
         -1.82138864, -0.30466658, -2.02559359,  1.93662278, -1.31756727,
         -1.25432739, -1.71406741]]),
 (100, 12))

The target is a numpy array of shape (n_samples). The values will be 0 or 1 because by default n_classes is 2

target, target.shape
(array([0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1,
        0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
        0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0,
        0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0,
        1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0]),
 (100,))

Training a RandomForestClassifier

Now that we have the data, we can train a classifier and use it to predict a label.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(data, target)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=2, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

Predicting an output label

After training we predict by passing in a data array. For simplicity we just choose one of the data values and we get a predicted label. This is not groundbreaking machine learning, but it shows how to quickly get a dataset that you can use to try different machine learning algorithms.

clf.predict([data[90]])
array([1])

Create a classification dataframe

The downside of the make_classification function is that it create numpy arrays without meaningful feature names. The bigger problem is that the features have different aspects - some are informative, others are redundant, or simply plain noise, without any indicator of which is which. To improve on it you can create dataframes, which allow for meaningful names that help in analysis and explainability.

from sklearn.datasets import make_classification
import pandas as pd
from datetime import datetime

def make_dataset(n_samples=1000, n_features=10, n_informative=6, n_redundant=2, 
                 n_classes=2, weights=None, random_state=2):
    data, target = make_classification(n_features=n_features,
                                     n_informative=n_informative, 
                                     n_redundant=n_redundant, 
                                     n_samples=n_samples, 
                                     n_classes=n_classes,
                                     weights=weights,
                                     random_state=random_state,
                                     shuffle=False)
    index = pd.date_range(periods=n_samples, freq=pd.tseries.offsets.BDay(), 
                          end=datetime.today()).normalize()
    columns = [f'Info{i}'  for i in range(n_informative)] + \
              [f'Redun{i}' for i in range(n_redundant)] + \
              [f'Noise{i}' for i in range(n_features - (n_informative + n_redundant))]
    df = pd.DataFrame(data, columns=columns, index=index).round(3)
    target = pd.Series(target, index=index)
    return df, target
    
data, target = make_dataset(1000, n_features=8, n_informative=4, 
                            n_classes=3, random_state=random_state)

The features

The generated dataframe contains 8 columns.

  • Informative features These are informative features, meaning features that have a predictive relationship with the target
  • Redundant features These features are generated as linear random combinations of the informative features
  • Noise These are just noise, and should have no predictive power
data
Info0 Info1 Info2 Info3 Redun0 Redun1 Noise0 Noise1
2016-06-20 1.744 1.871 -1.446 -1.364 1.321 0.481 1.188 1.070
2016-06-21 0.632 0.028 -0.756 -1.235 -0.475 1.227 0.005 -0.076
2016-06-22 0.879 0.957 -1.149 -0.020 0.880 0.246 0.282 0.761
2016-06-23 1.673 0.602 -1.594 -1.734 0.159 1.664 2.256 0.028
2016-06-24 0.385 0.067 -1.948 0.851 0.025 1.189 1.010 0.528
... ... ... ... ... ... ... ... ...
2020-04-13 -2.862 -0.170 0.767 -1.832 -3.387 1.120 -2.979 0.188
2020-04-14 -3.282 -1.113 0.751 -0.832 -3.789 1.281 -1.391 0.114
2020-04-15 -2.720 -1.133 0.383 0.557 -2.653 0.716 1.561 -0.633
2020-04-16 -0.499 0.354 0.266 0.690 0.304 -0.744 1.183 -1.057
2020-04-17 2.431 0.276 0.548 2.319 3.936 -2.577 0.539 1.430

1000 rows × 8 columns

The target

The target variable contains the values 0,1,2 - three classes since we specified three classes in the make_dataset function call. These are roughly evenly distributed, though wecould have specified a different distribution of values.

pd.DataFrame(target, columns=['Target']).Target.value_counts().to_frame().sort_index()
Target
0 334
1 333
2 333

Redundant Variables

If we plot the redundant variables we can see that it is a linear relationship to the informative variables. They can be safely dropped from the input features to a machine learning model, or otherwise handled in a special way. Of course, with real empirical data, you would not necessarily know that beforehand but would learn it during data exploration.

import altair as alt

alt.Chart(data).mark_circle().encode(
    x='Info0',
    y='Redun1'
).properties(
    title='Informative vs Redundant Variables'
)

Informative Variables

The relationship between the informative variables and the target variables are a trickier to display on a chart because the target variable is categorical but if we plot we can see a relationship between these variables and the target.

To make it easier to see the relationship we bin the informative variables and set the size of the marker to the count in each bin.

df = data.copy()
df['Target'] = target

Plot Informative Variables vs Target

alt.Chart(df).mark_circle().encode(
    alt.X('Info0', bin=True),
    alt.Y('Target', bin=True),
    size='count()'
).properties(
    title='Info1 vs Target'
)

Noise

Noise seems, well random

alt.Chart(df).mark_circle().encode(
    alt.X('Noise0', bin=True),
    alt.Y('Target', bin=True),
    size='count()'
).properties(
    title='Noise vs Target'
)

Feature Importance

To see the importance of each feature, we train a RandomForestClassifier and then view or plot the feature importance.

clf = RandomForestClassifier(min_samples_split=4)
clf.fit(data, target)
feature_importance = pd.DataFrame({'importance': clf.feature_importances_,
                                   'feature': data.columns}).round(2)

alt.Chart(feature_importance).mark_bar().encode(
    y='feature', 
    x='importance:Q'
).properties(
    title='Feature Importance',
    height=240
)

As expected, the noise variables have the least importance.