Google Colab and Auto-sklearn with Profiling

This article is a follow up to my previous tutorial on how to setup Google Colab and auto-sklean. Here, I will go into more detail that shows auto-sklearn performance on an artificially created dataset. The full notebook gist can be found here.

First, I generated a regression dataset using scikit learn.

X, y, coeff = make_regression(
    n_samples=1000,
    n_features=100,
    n_informative=5,
    noise=0,
    shuffle=False,
    coef=True
)

Subset of 100 generated features

This generates a dataset with 100 numerical features where the first 5 features are informative (these are labeled as “feat_0” to “feat_4”). The rest (“feat_5” to “feat_99”) are random noise. We can see this in the scatter matrix above where only the first 5 features show a correlation with the label.

We know that this is a simple regression problem which could be solved using a linear regression perfectly. However, knowing what to expect helps us to verify the performance of auto-sklearn which trains its ensemble model using the following steps:

import autosklearn.regressionautoml = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=300,
    n_jobs=-1
)
automl.fit(
    X_train_transformed,
    df_train["label"]
)

I also created random categorical features which are then one-hot-encoded into a feature set “X_train_transformed“. Running the AutoSklearnRegressor for 5 minutes (time_left_for_this_task=300) produced the following expected results:

predictions = automl.predict(X_train_transformed)
r2_score(df_train["label"], predictions)
>> 0.999
predictions = automl.predict(X_test_transformed)
r2_score(df_test["label"], predictions)
>> 0.999

A separate pip package PipelineProfiler helps us visualize the steps auto-sklearn took to achieve the result:

PipelineProfiler output

Above we can see the attempts auto-sklearn made to generate the best emsemble of models within the 5 minute constraint I set. The best model found was Liblinear SVM, which produced R2 of nearly 1.0. As a result, this toy ensemble model gives weight of 1.0 to just one algorithm. Libsvm Svr and Gradient boosting scored between 0.9–0.96.