Scikit-learn Pipeline with Feature Engineering
Contents
Summary
In general, a machine learning pipeline should have the following characteristics:
- To ensure data consistency, the pipeline should include every step (such as feature engineering) required to train and score training and testing datasets, and score real time requests. The pipeline does not need to include one-off steps such as removing duplicates.
- Numerical features are transformed using scikit-learn classes. SimpleImputer is used to fill missing values and StandardScaler for scaling.
- Categorical columns are similarly transformed. OneHotEncoder is applied transforming columns containing categorical values. Importantly, I like to define the categories argument to prevent the Curse of dimensionality that might occur when too many categories are present.
- An example custom feature engineering class DailyTrendFeature is included in the pipeline for illustration.
- The pipeline allows for parallel preprocessing subject to the limits of the computing environment. For example, the preprocessing of categorical and numerical features can take place in parallel because the transformation steps are independent of each other. This is accomplished using scikit-learn's
FeatureUnion(n_jobs=-1, ...)
class that combines other pipeline steps. - Notebook as Github Gist
Notebook