ANotes

Deploying Language Models With Gradio On Hugging Face

2023-10-14T00:00:00-05:00

Machine learning models (including language models) can be easily deployed using generous free tier on Hugging Face and a python-based open source UI tool Gradio by following these steps.

See live deployed app and source code here

For local development, create the following Dockerfile. It differs from production Dockerfile in how secrets are loaded and the use of
```
CMD ["gradio", "app.py"]
```
which runs (and reloads) source files every time a change is noticed.
docker-compose will launch the development Dockerfile using command
```
export HF_TOKEN=paste_HF_token && docker-compose -f docker-compose.yml up gradiohf
```
where HF_TOKEN is an optional personal token provided by Hugging Face to ensure that license restrictions are being followed for certain models (such as Llama 2).
Develop your Gradio app.py. This deployed example represents the absolute smallest version that selects a language model based on environmenal variable os.environ.get("MODEL"). The selections includes Llama 2 which will require a paid Spaces plan to run on Hugging Face (with no code changes!). The live example runs a small toy model google/flan-t5-small that easily runs on the free tier.
View your Gradio app running locally in browser:
```
http://0.0.0.0:7860
```
Create production Dockerfile and deploy on Hugging Face Spaces using this great documentation.

Example of Gradio UI deployed on Hugging Face

Machine Learning Notes

2022-05-07T00:00:00-05:00

Algorithms
Bayes
Explainability
MLOps
Model Evaluation
Preprocessing
Reinforcement Learning
SQL
Statistics

Algorithms

K-means: aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion. Use the “elbow” method to identify the right number of means. scikit tutorial
KNN: Simple, flexible, naturally handles multiple classes. Slow at scale, sensitive to feature scaling and irrelevant features. scikit tutorial
Linear Discriminant Analysis (LDA): A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix. scikit tutorial
Linear regression
- assumptions (LINE) source
  - Linearity
  - Independence of errors
  - Normality of errors
  - Equal variances
  - Tests of assumptions: i) plot each feature on x-axis vs y_error, ii) plot y_predicted on x-axis vs y_error, iii) histogram of errors.
- Overspecified model can be used for prediction of the label, but should not be used to ascribe the effect of a feature on the label.
- Linear algebra solution
Naive Bayes: uses naive conditional independence assumption of features. scikit
PCA: transform data using k vectors that minimize the perpendicular distance to points. PCA can be also thought of as an eigenvalue/engenvector decomposition. scikit. Intuition paper
Pearson’s correlation coefficient**. wiki.
Random Forests: each tree is built using a sample of rows (with replacement) from training set. Less prone to overfitting. scikit
RNN: Karpathy tutorial
Sorting tutorial.
Stochastic gradient descent tutorial. Calculus solution:
SVD: Singular Value Decomposition intuition with PCA use case
SVM: Effective in high dimensional spaces (or when number of dimensions > number of examples). SVMs do not directly provide probability estimates. scikit
Transformers tutorial

Bayes

Explainability

Books: Interpretable Machine Learning
Tutorials: twosigma: a brief survey
EthicalML tools EthicalML github
Partial dependence plots (PDP): x-axis = value of a single feature, y-axis = label. scikit
Individual conditional expectation (ICE): x-axis = value of a single feature, y-axis = label. scikit
Permutation feature importance: Randomly shuffle features and calculate impact on model metrics such as F1. scikit
Global surrogate: train an easily interpretable model (such as liner regression) on the predictions made by a black box model
Local Surrogate: LIME (for Local Interpretable Model-agnostic Explanations). Train individual models to approximate an individual prediction by removing features to learn their impact on the prediction
Shapley Value (SHAP): The contribution of each feature is measured by adding and removing it from all other feature subsets. The Shapley Value for one feature is the weighted sum of all its contributions

MLOps

BentoML: open platform that simplifies ML model deployment by saving models in a standard format, defining a web service with pre/post processing, deploying the web service in a container
Data a16z
ML Blueprint a16z
Lifecycle AWS blog
EthicalML/awesome-production-machine-learning EthicalML github
Pipeline tools EthicalML/data-pipeline-etl-frameworks
MLOps Google Google

Model evaluation

Classification:
- Recall: wiki
- Receiver operating characteristic (ROC): relates true positive rate (y-axis) and false positive rate (x-axis). TPR = TP / (TP + FN) and FPR = FP / (FP + TN). scikit
Regression
- R2: strength of a linear relationship. Could be 0 for nonlinear relationships. Never worsens with more features. scikit
Learning curves scikit tutorial
Overfitting and regularization
- Overfitting (high variance) options: more data, increase regularization, or decrease model complexity. tutorial
- Underfitting (high bias) options: decrease regularization, increase model complexity
- Lasso regression: linear model regularization technique with tendency to prefer solutions with fewer non-zero coefficients. scikit tutorial.
- Ridge regression: imposes a penalty on the size of the coefficients scikit
- Validation curve: scikit

Preprocessing

scikit
Analysis
1. Remove duplicates
2. SOCS of each feature: Shape (skew), Outliers, Center, Spread
3. Feature correlation
Production pipeline
1. Outliers: remove or apply non-linear transformations
2. Missing values
  - SMOTE: Generate and place a new point on the vector between a minority class point and one of its nearest neighbors, located [0, 1] percent of the way from the original point. Algorithm is parameterized with k_neighbors. tutorial
3. Standardization
4. Discretization
5. Encoding categorical features
6. Generating polynomial features
7. Dimensionality reduction

Reinforcement Learning

SQL

window functions, row_number() and partition(): tutorial
COALESCE(): evaluates the arguments in order and returns the current value of the first expression that initially doesn’t evaluate to NULL. tutorial

Statistics

Statology tutorial
Means
- Arithmetic: wolfram
- Geometric: used in finance to calculate average growth rates and is referred to as the compounded annual growth rate. wolfram
- Harmonic: used in finance to average multiples like the price-earnings ratio because it gives equal weight to each data point. Using a weighted arithmetic mean to average these ratios would give greater weight to high data points than low data points because price-earnings ratios aren't price-normalized while the earnings are equalized. wolfram
Probability distributions Description acronym SOCS: shape, outliers, center, spread. Comparison article.
- Beta: probability distribution on probabilities bounded [0, 1]. tutorial
- Binomial: probability of obtaining k successes in n binomial experiments with probability p. tutorial
- Normal: empirical rule is sometimes called the 68-95-99.7 rule
- Poisson: the probability of obtaining k successes during a given time interval. Statology tutorial. tutorial 2.Zero Inflated Poisson Regression Model
Sample variance: divided by n-1 to achieve an unbiased estimator because 1 degree of freedom is used to estimate b0. tutorial
Tests
- ANOVA: Analysis of variance compares the means of three or more independent groups to determine if there is a statistically significant difference between the corresponding population means. Statology tutorial
- F-statistic: determines whether to reject a full model (F) in favor of a reduced (R) model. Reject full model if F is large — or equivalently if its associated p-value is small. tutorial
- Linear regression coefficient CI: tutorial
- T-test: tutorial

Machine Learning Docker Template

2021-12-18T00:00:00-05:00

Summary
Code

Summary

The purpose of this post is to propose a template for machine learning projects that strives to follow these principles:

All data scientists can quickly setup an identical development environment based on Docker that encourages good software engineering practices.
Dependency management is handled during the environment's startup by Miniconda and requires minimal manual changes.
Notebooks are encouraged for exploration. However, for production purposes notebooks must be version controlled, parametrized and run using Papermill.

Code

The template is available on github adamnovotnycom/machine-learning-docker-template. The general template structure looks as follows:

Dockerfile defines the development environment and uses Miniconda as base image

    FROM continuumio/miniconda3
    ...
    RUN conda env create -f conda.yml
    RUN echo "source activate dev" > ~/.bashrc
    ...

conda.yaml is used for dependency management and includes standard data science packages.
ml_docker_template package should include all production code that can be installed and run by an external system. As a result, the code can be developed locally but also easily runs on an external machine when additional compute power is needed for model training or when additional permissions are required for deployment.

Keras LSTM Forecasting Using Synthetic Data

2021-11-13T00:00:00-05:00

Summary
Notebook

Summary

Keras LSTM can be a powerful tool for forecasting. Below is a simple template notebook showing how to setup a data science forecasting experiment.

Dataset

A synthetic dataset was generated using a scikit-learn regression generator make_friedman1. The dataset is nonlinear, with noise, and some features are manually scaled to make the deep learning task more challenging. Time series dependence is created by making each label a weighted average of the make_friedman1 generated values and previous labels. For details see notebook function generate_data().

The image below shows correlations between the generated features and future_label we are trying to forecast. Features x_0 - x_4 are the only informative features as can be verified from the bottom row showing meaningful but not very strong correlations:

Model training

The model is a simple NN with a single hidden layer defined as keras.layers.LSTM(32). The generated dataset is split into training, validation, and test sets, each honoring time series nature of the data. Validation set is used to stop training early to prevent overfitting. However, this is not a concern for our synthetic dataset as can be seen from following chart. The validation curve never starts increasing as training epochs continue:

Model evalution

Comparing predictions and actual labels for the validation set shows strong performance even though there are clear optimizations that can be made near extreme values:

However, the validation set was already used during training for early stopping. This is why we set aside a test dataset the model has never seen during training. The test dataset is the only true evaluation of the expected performance of the model and in this case it confirms that the model performs well for the synthetic dataset:

Notebook

Scikit-learn Pipeline with Feature Engineering

2021-08-30T00:00:00-05:00

Summary
Notebook

Summary

In general, a machine learning pipeline should have the following characteristics:

To ensure data consistency, the pipeline should include every step (such as feature engineering) required to train and score training and testing datasets, and score real time requests. The pipeline does not need to include one-off steps such as removing duplicates.
Numerical features are transformed using scikit-learn classes. SimpleImputer is used to fill missing values and StandardScaler for scaling.
Categorical columns are similarly transformed. OneHotEncoder is applied transforming columns containing categorical values. Importantly, I like to define the categories argument to prevent the Curse of dimensionality that might occur when too many categories are present.
An example custom feature engineering class DailyTrendFeature is included in the pipeline for illustration.
The pipeline allows for parallel preprocessing subject to the limits of the computing environment. For example, the preprocessing of categorical and numerical features can take place in parallel because the transformation steps are independent of each other. This is accomplished using scikit-learn's
```
FeatureUnion(n_jobs=-1, ...)
```
class that combines other pipeline steps.
Notebook as Github Gist

Notebook

Global Temperature Forecast Using Prophet and CO2

2021-05-16T00:00:00-05:00

In this article I will leverage the global temperate dataset I discussed previously to make a temperature forecast using Facebook Prophet for the next 50 years. Note: the temperature dataset serves ONLY as a vehicle to learn how to do forecasting using Prophet. In general, climate and other complex sciences cannot be solved using a simple tool such ash Prophet.

All code can be found in this gist.

Data

To review, the temperature dataset covers monthly data since 1850 including 95% confidence intervals (high CI - blue, low CI - red):

In addition, I will use the CO2 emmissions data from ourworldindata.org:

Forecast

I will only highlight here how the Prophet API works (specifically when we want include an additional regressor such as CO2). First, we need to format the training dataset such that the label column is y and date is ds

Next, we train the Prophet model and add the custom regressor (CO2):

m = Prophet()
m.add_regressor("co2_monthly_bn_tons", 
                prior_scale=0.5, 
                mode="multiplicative",
                standardize=True)
m.fit(prophet_train_set)

Then we need to create a forecast dataset that includes the dates to be forecasted and assumptions for the custom regressor. In the temperature forecasting dataset, I created timestamps for the next 50 years. Last 3 rows of the forecast dataset ("prophet_forecast_set"):

In order to create the dataset above, I had to make an assumption about CO2 growth. I assumed that monthly growth over the next 50 years will continue at the same pace as it has between 2000-2020:

In reality, the value of the temperature forecast comes from the data scientist's background knowledge of the field. In this example, in order for the temperature forecast to be valuable, we have to be able to forecast CO2 emissions (and other regressors) with high confidence.

Performing the actual forecast using Prophet is very simple:

forecast_prophet = m.predict(prophet_forecast_set)
forecast_prophet.head(5)

Prophet generates valuable confidence intervals for its forecast. These confidence bars are more valuable than the point forecast itself. In the chart below, the point forecast in 2070 is 16.1C. However, the forecast ranges widely from nearly 17C to 15.2C.

Validation

The step that many people doing forecasts "conveniently" skip is validation. In other words, if we approached the problem the same way in the past, how incorrect would we turn out to be today.

Let's assume that we are standing in 1970, and we apply the exact same methodology as above to forecast the next 50 years (so we are forecasting 1970-2020). What would the forecasting graphs look like compared to the reality we've already experienced? First, our hypothetical CO2 assumption would match reality reasonably nicely:

However, our temperature point forecast would underestimate reality. Our forecast is still within confidence intervals because it nearly perfectly aligns with the upper bound. However, the behavior of the forecast doesn't appear to reflect the upward slope we've experienced historically:

This is an example of why confidence intervals are more important than point estimates. Also, it reflects how important it is to be intellectually honest when forecasting and performing historical validation. The takeaway here might be that we are missing additional regressors to be able to properly forecast the temperature physical process.

Berkeley Earth Global Temperature Data

2021-05-14T00:00:00-05:00

Berkeley Earth publishes an unique dataset with global temperature measurements. Below is a guide to the download the data and start analyzing it using Python. All code can be found in this gist.

Download .txt file from Berkeley Earth data website section "Land + Ocean (1850 — Recent)" and read it using the following Python command:

colspecs = [(2, 6), (10, 12), (14, 22), (24, 29)]
df = pd.read_fwf(
    "/content/drive/My Drive/Colab Notebooks/berkeley_earth/data/Land_and_Ocean_complete.txt",
    colspecs=colspecs, 
    header=85
)
df.columns = ["year", "month", "anomaly_C", "confidence_95_C"]
df.head(12)

colspecs defines the column indexes so (2, 6) represents year in the source text file.

The data documentation explains that anomaly_C is the recorded temperature anomaly in Celsius relative to estimated Jan 1951-Dec 1980 global mean temperature of 14.108 +/- 0.02. The chart below shows the absolute air temperatures along with 95% uncertainty intervals (in green) recorded during the 2000s.

Dynamic HTML with Python, AWS Lambda, and Containers

2021-03-27T00:00:00-05:00

This article is an extension of my previous article describing a similar deployment process using native AWS Lambda tools. However, Amazon since started supporting container images and updated it’s pricing policy to 1ms granularity. Both are major developments improving tooling and making small deployments cost effective.

My previous article focused on the logic of the code and didn’t address how to actually deploy the function because that was well covered by AWS in its many tutorials. Here I explore the new the container deployment options while keeping all business logic untouched. Please review the AWS tutorial on deploying a generic Python Lambda code using containers which I leveraged below.

1. Dockerfile

FROM public.ecr.aws/lambda/python:3.8
RUN mkdir -p /mnt/app
ADD app.py /mnt/app
ADD index.html /mnt/app
WORKDIR /mnt/app
RUN pip install --upgrade pip
RUN pip install Jinja2==2.11.*
CMD ["/mnt/app/app.handler"]

I am using the AWS base image because it is packaged with a very nice mini server that simulates function responses when developing locally. This is extremely useful because we can call the function with 100s of arguments and verify that it behaves as expected before deployed.

App code

From the Dockerfile, we can see that all application code is contained in two files:

1) app.py:

import os
from jinja2 import Environment, FileSystemLoader

def lambda_handler(event, context):
    env = Environment(loader=FileSystemLoader(os.path.join(os.path.dirname(__file__), "."), encoding="utf8"))
    my_name_from_query = False
    if event["queryStringParameters"] and "my_name" in event["queryStringParameters"]:
        my_name_from_query = event["queryStringParameters"]["my_name"]
    template = env.get_template("index.html")
    html = template.render(
        my_name=my_name_from_query
    )
    return {
        "statusCode": 200,
        "body": html,
        "headers": {
            "Content-Type": "text/html",
        }
    }

2) index.html:

app.py simply parses one argument named “my_name” from the Lambda query string and passes it to the html template as variable named “my_name”. Jinja2 then parses the variable and returns the final template.

Calling and testing the app locally

Testing the app locally is very simple thanks to the new container packaging. Simply run docker-compose -f docker-compose.yml up, where docker-compose.yml file is defined as:

version: '3'
services:
  cont_name:
    container_name: cont_name
    image: cont_name_img
    build:
      context: .
      dockerfile: Dockerfile
    volumes:
      - .:/mnt/app
    ports:
      - "9000:8080"
    stdin_open: true
    tty: true
    restart: always

This stands up the function locally on a simple AWS-provided server. We can send requests and monitor responses using Python code such as:

import requests
r = requests.get(
    "http://localhost:9000/2015-03-31/functions/function/invocations", 
    data=open("event.json", "rb")
)
print(r.json())

where “event.json” is any .json file we wish to send to the lambda function as arguments. In the example case above, we would send something like:

{
  "queryStringParameters": {
    "my_name": "Adam"
  }
}

Cost

The simple AWS base server returns responses such as the one below. This is where we can see the significant impact of the new 1ms pricing update. The cost of running this example code is about 9ms which is very small considering that we are returning a full html template to browsers. However, previously AWS would charge for the full 100ms because that was the minimum charge defined. Now, this function could cost nearly 90% less!

Google Colab and Auto-sklearn with Profiling

2021-03-20T00:00:00-05:00

This article is a follow up to my previous tutorial on how to setup Google Colab and auto-sklean. Here, I will go into more detail that shows auto-sklearn performance on an artificially created dataset. The full notebook gist can be found here.

First, I generated a regression dataset using scikit learn.

X, y, coeff = make_regression(
    n_samples=1000,
    n_features=100,
    n_informative=5,
    noise=0,
    shuffle=False,
    coef=True
)

This generates a dataset with 100 numerical features where the first 5 features are informative (these are labeled as “feat_0” to “feat_4”). The rest (“feat_5” to “feat_99”) are random noise. We can see this in the scatter matrix above where only the first 5 features show a correlation with the label.

We know that this is a simple regression problem which could be solved using a linear regression perfectly. However, knowing what to expect helps us to verify the performance of auto-sklearn which trains its ensemble model using the following steps:

import autosklearn.regressionautoml = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=300,
    n_jobs=-1
)
automl.fit(
    X_train_transformed,
    df_train["label"]
)

I also created random categorical features which are then one-hot-encoded into a feature set “X_train_transformed“. Running the AutoSklearnRegressor for 5 minutes (time_left_for_this_task=300) produced the following expected results:

predictions = automl.predict(X_train_transformed)
r2_score(df_train["label"], predictions)
>> 0.999
predictions = automl.predict(X_test_transformed)
r2_score(df_test["label"], predictions)
>> 0.999

A separate pip package PipelineProfiler helps us visualize the steps auto-sklearn took to achieve the result:

Above we can see the attempts auto-sklearn made to generate the best emsemble of models within the 5 minute constraint I set. The best model found was Liblinear SVM, which produced R2 of nearly 1.0. As a result, this toy ensemble model gives weight of 1.0 to just one algorithm. Libsvm Svr and Gradient boosting scored between 0.9–0.96.

Google Colab and AutoML: Auto-sklearn Setup

2020-12-04T00:00:00-05:00

Auto ML is fast becoming a popular solution to build minimal viable models for new projects. A popular library for Python is Auto-sklearn that leverages the most popular Python ML library scikit-learn. Auto-sklearn runs a smart search over scikit-learn models and parameters to find the best performing ensemble of models.

This tutorial describes how to setup Auto-sklearn on Google Colab. The complete notebook gist includes a toy project that uses an old Airbnb dataset from Kaggle.

The key first step is to install linux dependencies alongside Auto-sklearn:

!sudo apt-get install build-essential swig
!pip install auto-sklearn==0.11.1

After running these commands in Colab, restart the Colab runtime and run all commands again.

The Airbnb dataset can be used for a regression project where price is the label. I selected a few numerical and categorical features randomly so the dataset used for modeling has the following characteristics:

A more sophisticated ML project would require a detailed feature selection process and data analysis at this stage. For example, does the maximum value of 1,250 for minimum_nights make sense? In this case, I am simply showing the Auto-sklearn setup so I will skip these time consuming steps.

Next, all numerical features are standardized and missing values filled. Scikit-learn (and therefore Auto-sklearn) cannot handle string categories so categorical features are one hot encoded. Also, infrequently appearing categories are combined into a single bucket to combat the Curse of dimensionality. In this case, any neighborhood that appears less than 0.5% of the time is renamed to “neighborhood_other”. Before transformations, the first 5 rows of the training dataset have the following items:

After transformations, the first few columns of the 5 rows look like this:

I am finally ready to explore Auto-sklearn using few simple commands that fit a new model:

import autosklearn.regression
automl = autosklearn.regression.AutoSklearnRegressor(
  time_left_for_this_task=120,
  per_run_time_limit=30,
  n_jobs=1
)
automl.fit(
  X_train_transformed,
  y_train
)

Finally, here is how the model performs on a test dataset:

import sklearn.metrics
predictions = automl.predict(X_test_transformed)
sklearn.metrics.r2_score(y_test, predictions)
 output: 0.1862

An alternative approach that doesn’t use Auto-sklearn would be to manually select a model and run a grid search to find best parameters. A typical, well-performing algorithm is RandomForestRegressor so I might try the following:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
model = RandomForestRegressor(max_depth=3, random_state=0)
parameters = {
  "max_depth": (2, 3, 5)
}
grid = GridSearchCV(model, parameters, cv=5, scoring="r2")
grid.fit(X_train_transformed, y_train.values.ravel())

For comparison, the performance of this model would be:

predictions = grid.predict(X_test_transformed)
sklearn.metrics.r2_score(y_test, predictions)
 output: 0.0982

Impressively, the default Auto-sklearn R2 performance of 0.186 is nearly twice as good as simplistic scikit-learn-only performance of 0.098. These are not intended to be absolute benchmarks because I performed no customization but the relative performance is worth noting. The results suggest that Auto-sklearn can set a very reasonable lower performance bound that no model deployed in production should underperform.

More about me: adamnovotny.com

Google Paper: 24/7 by 2030

2020-10-31T00:00:00-05:00

Google released a white paper describing how the company intends to generate all of its electricity needs from renewable energy sources by 2030. Previously, Google committed to reducing emissions by buying offsets or generating renewable energy off-cycle. This new commitment goes by further: “Google intends to match its operational electricity use with nearby carbon-free energy sources in every hour of every year”

Everybody interested should read it — it’s short.

Google cooperated with a Watttime to generate the dataset that measures the carbon emissions intensity in regions where Google’s data centers are located. Watttime has a very interesting API providing carbon intensity in real time. I collected a random set of data points over 24 hours for a selected number of regions where Google data centers are located in the US. All code is available in this Github gist.

Marginal Operating Emissions Rates (MOER): 0 represents no emissions (clean energy generation), 100 represents highest emissions. In other words, for all regions where MOER is high, Google has a lot of work to do to replace (or store) electricity used from clean sources. In the chart above, Midlothian, TX appears to be one of those challenging locations. On the other hand, Mayes County, OK is a location that Google appears to be satisfied with: “Our highest clean energy percentage is in Oklahoma (Southwest Power Pool), where our purchases of wind power helped drive carbon-free energy performance at our data center from 41% to 96%”

I firmly believe that renewable energy will be widely adopted only if it is at least as cheap as alternatives. So exploring the data from a financial perspective is critical: From 2009 to 2019, costs for wind and solar power declined by 70% and 89% (see page 8).

Serving Dynamic Web Pages using Python and AWS Lambda

2020-07-25T00:00:00-05:00

While AWS Lambda functions are typically used to build API endpoints, at their core Lambda functions can return almost anything. This includes returning html markup with dynamic content.

I will not go into details describing how to deploy AWS Lambda functions. Please see the official documentation. I will however describe how to return dynamic html content instead of a typical JSON.

Step 0 — Optional

If you prefer to develop and test lambda functions locally (as I do), you can use Docker to simulate the AWS lambda function environment. A sample Dockerfile I use is below.

FROM amazonlinux:latest
RUN mkdir -p /mnt/app
ADD . /mnt/app
WORKDIR /mnt/app
RUN yum update -y
RUN yum install gcc -y
RUN yum install gcc-c++ -y
RUN yum install findutils -y
RUN yum install zip -y
RUN amazon-linux-extras install python3=3.6.2
RUN pip3 install --upgrade pip
RUN pip3 install -r requirements.txt -t aws_layer/python

The requirements.txt includes just one package for simplicity. It is the common templating for Python called Jinja2

Jinja2==2.11.1

You can test your Lambda function by simple calling it with sample parameters:

import lambda_function
event = {
    "queryStringParameters": {
        "param1": "value1"
    },
    "path": "/api",
    "requestContext": {
         "param2": "value2"
    }
}
res = lambda_function.lambda_handler(event=event, context={})
assert 200 == int(res["statusCode"])

Step 1 — Write html template

In this step, we write the html template the Lambda function will return. A good default is the new Bootstrap 5 CSS framework where the recommended starting markup looks something like this:

Saving this file in folder “templates” and naming it index.html, we are ready to write the Lambda function.

Step 2 — Write Lambda function to serve your html page

In the example below, the lambda function expects URL parameters and parses those. So when parsing a custom URL, the format would look something like this: Example.com/?my_name=somename. See step 10 in this tutorial to add custom URLs to your API Gateway-triggered Lambda functions.

import os
import sys
from jinja2 import Environment, FileSystemLoader

def lambda_handler(event, context):
    env = Environment(loader=FileSystemLoader(os.path.join(os.path.dirname(__file__), "templates"), encoding="utf8"))
    my_name = False
    if event["queryStringParameters"] and "my_name" in event["queryStringParameters"]:
        my_name_query = event["queryStringParameters"]["my_name"]
    template = env.get_template("index.html")
    html = template.render(
        my_name=my_name_query
    )
    return response(html)

def response(myhtml):
    return {
        "statusCode": 200,
        "body": myhtml,
        "headers": {
            "Content-Type": "text/html",
        }
    }

jinja2 loads your previously created index.html using class “FileSystemLoader” and we store it as variable “env”

variable “my_name” is parsed from the URL query parameters as explained above and stored as the Python variable my_name_query

the jinja2 render function then passes my_name_query to the template and returns the html page

Also published on adamnovotny.com

Custom VPN using PiVPN and public cloud

2018-12-30T00:00:00-05:00

Motivation: Many public Wi-Fi networks block certain internet ports and protocols. For example, a public library might only allow ports 80 and 443 and the TCP protocol. Leaving aside the logic of such decisions by network owners, they prevent users from taking advantage of many commercial VPN products that rely on other ports. The goal of this article is to create a custom VPN solution to improve privacy even on such restricted public networks.

All step are outlined in more detail in this Github repo. The tutorial is written for Python 3 and Google Cloud Compute. However, all public clouds can be used including AWS or Azure.

Create public cloud compute instance

Login to GCP console. Create an Ubuntu machine and make sure to allow https traffic. Then locate your public IP which is where your traffic will be routed. In GCP, you can find it using the following steps (as of Dec 2018): VPC network > External IP addresses > switch the type of your instance IP from “Ephemeral” to “Static”. This will be your public IP.

Create PiVPN instance

curl -L https://install.pivpn.io | bash

Follow all setup steps using default values except for port and protocol. Select port 443 and protocol TCP. Select reboot at the end of the installation.

Create VPN credentials

pivpn add

Enter your custom username and password. Download credentials to your computer from your newly create cloud computer instance. Credentials are typically located at ~/ovpns on your Ubuntu instance.

Download a VPN client for your platform

For MacOS you may use Tunnelblick. Then drag credentials (.ovpn) from the previous step to the Tunneblick app icon. Click on Tunneblick icon to connect to your VPN with your custom username and password.

Final notes:

use port 80 instead of 443 above if necessary.

to minimize cost, remember to shut down your compute instance when you are not using the VPN. The typical cost is < $20 for 100GB of traffic and 24/7 usage which is in line with respectable third-party VPN providers. However, your cost may be significantly lower if you shut down unused instances and use pre-emptible instances (GCP-specific).

Serverless web apps with Firebase and AWS Lambda

2018-09-09T00:00:00-05:00

Serverless has become a popular solution for small to medium-sized projects. The downside is a technology stack lock-in which forces developers to use technologies that might not be optimal for their projects. For example, people using Google’s Firebase to host their static resources have to write custom endpoint functions in JavaScript or TypeScript (as of August 2018). Developers typically use custom backend functions to hide business logic or proprietary data operations from users because anything that runs in the browsers front end as JavaScript is ultimately an open book from the users perspective.

One simple solution is to combine Firebase with custom functions using a different platform. I will outline the steps to create a Firebase-hosted web app, setup DNS for subdomain, and create AWS Lambda functions to serve custom business logic as APIs. This is just an example setup and all major cloud players provide solutions that can be combined in other ways such as using AWS S3 to host statics resources and Google’s Cloud Function to serve business logic API.

At the end we will have:

www.example.com (single page app served by Firebase)

api.example.com (AWS lambda function serving custom business logic used by www.example.com)

I will not go into specific details on each platform because their UIs constantly change. Instead, I will highlight the sequence of steps I typically take to setup the services quickly.

Firebase hosting

1) Deploy a static web app to Firebase by following this part of the Firebase documentation. The end result will be a public web app. Its URL will look something like this: my-project-name.firebaseapp.com

2) Let’s assume we purchased the custom domain example.com. We now need to update the DNS records so that example.com and www.example.com point to our static web app.

3) Go to your Firebase project dashboard and in the hosting section initiate the steps to connect to a custom domain. You will need to verify ownership of the domain by adding a DNS TXT record to your registrar’s DNS settings. As always, the documentation is useful.

4) Go to your domain registrar’s DNS settings, and create a DNS A record for subdomain www pointing to the IP address of the Firebase servers obtained in the previous step. After SSL certificates are automatically provisioned by Firebase, users can go to https://www.example.com to locate your Firebase app.

5) We also need to make sure that users entering just example.com are also pointed to https://www.example.com. To accomplish this, return to your registrar’s DNS settings and a setup subdomain forwarding. The exact steps vary for each registrar but the end result will be example.com -> https://www.example.com. If possible, set the redirect as permanent 301, forward path, and enable SSL.

AWS Lambda

At this point we have a web app deployed and using our custom URL. The app however uses the subdomain api.example.com to obtain proprietary data. In Angular, the code requesting data from the subdomain may look something like this:

const headers = {
    headers: new HttpHeaders({
      'Content-Type':  'application/json',
      'x-api-key': 'some-api-key'
    })
};
this.http.get('https://api.example.com', headers)
.subscribe((data: string) => {
    const dataJson = JSON.parse(data);
    // some data operations
    }
);

If our backend is relatively simple (doesn’t require large third party packages) and runs fast, the easiest solution is to deploy cloud functions at one of the largest providers. AWS limits Lambda deployments to 50MB and the default timeout is 3 seconds which are reasonable guidelines to determine whether your custom API backend is suitable for serverless functions.

6) We need to create a Lambda function. I like to test my lambda functions locally and then deploy them as zip files to AWS. For Python, follow this tutorial. Lambda supports all major languages and similar tutorials exists for at least Node.js, C#, Go, Java.

7) Next we need to make the function publicly available so we will use API Gateway to create a public endpoint. Make sure to check that API key is required and then go to Actions and Deploy API.

8) Secure the endpoint with at least an API key which can be created in the API Gateway as well.

9) Create a Usage Plan that will limit how often your API can be used. This will prevent your Lambda function from being overused. While AWS Lambda has a very generous free tier, security is paramount for peace of mind. A Usage Plan basically connects the API key (step 8) to the endpoint deployment (step 7). At this point, you should be able to use your Lambda function by going to a URL that looks something like this https://xyz1234567.execute-api.us-east-1.amazonaws.com/stage. Remember that an API key is required as a header so tools such as Postman are useful to customize the API requests easily.

10) Ultimately, we want to have a nice-looking URL such as api.example.com instead of the long random URL above. First, we need to create a certificate for our subdomain to so that our connection supports SSL (https). Go to Certificate Manager and follow the steps to create a certificate managed by AWS:

11) Now that we have a certificate available, return to API Gateway and go to Custom Domain Names and create the API name (such as api.example.com) and select the ACM certificate created in previous step. Map it to your API deployment. This will generate a Target domain name of the form xyz1234567899.cloudfront.net.

12) Return to your domain registrar’s DNS records, and create a CNAME record pointing to the target domain name above (such as xyz1234567899.cloudfront.net). Now once DNS records propagate, requesting api.example.com is going to terminate at your Lambda function and will be accessible by your Firebase frontend.

That’s it! Now you can deploy a fully featured web app with a custom backend, URL and generous free tiers (as of August 2018). With a little bit of practice the process takes about an hour subject to DNS propagation and requires virtually no backend deployment knowledge. It scales well for most small to medium-sized apps that do not require specialized compute-intensive workloads such as Machine Learning (see my ML deployment article here).

Machine Learning Tutorial #4: Deployment

2018-09-02T00:00:00-05:00

In this final phase of the series, I will suggest a few options ML engineers have to deploy their code. In large organizations, this part of the project will be handled by a specialized team which is especially important when scaling is a concern. Other tutorials in this series: #1 Preprocessing, #2 Training, #3 Evaluation , #4 Deployment (this article). Github code.

Stack Selection

The number of options to deploy ML code is numerous but I typically decide between at least the three general buckets:

Solution provided as-a-service (e.g. Microsoft Azure Machine Learning Studio)

Serverless function (e.g. AWS Lambda)

Custom backend code (e.g. Python Flask served by Heroku)

As-a-service solution

Platforms such as Microsoft Azure Machine Learning Studio offer the full suite of tools for the entire project including preprocessing and training. Custom API endpoints are usually easy to generate and writing code is often not necessary thanks to drag-and-drop interfaces. The solutions are often well optimized for lazy learners where evaluation is the most expensive computational step. The downside is that it is sometimes more challenging to bring in custom code (such as the final model) without going through all the project steps on the platform.

Serverless function

Serverless functions are a good solution for inexpensive computations. AWS uses default timeout of 3 seconds for a function to complete. While timeouts can be extended, the default value is often a good general guideline when deciding about suitability. Lambda only allows 50MB of custom code to be uploaded which is generally not enough for most machine learning purposes. However, functions are well suited for fast computations such as linear regression models. Another downside is that platforms support only specific languages. In terms of Python solutions, AWS Lambda supports versions 2.7 and 3.6 only at the time of writing this article.

Custom backend code

Writing a custom backend code on platform such as Heroku or Amazon’s EC2 allows us to replicate fully the code we write on local machines. The code and server deployment can be fully customized for the type of ML algorithm we are deploying. The downside of such solutions is their operational complexity because we need to focus on many steps unrelated to ML such as security.

I will deploy the code on Heroku which offers a free tier for testing purposes. The lightweight Flask framework will drive the backend. The primary reason for this choice is that it allows us to reuse essentially all the code written in previous tutorials for the backend. We can install Flask with Python 3.6 and all machine learning libraries we use previously side by side.

The entire backend code to run the app is literally a few lines long with Flask:

import pickle
import pandas as pd
from flask import Flask, jsonify, request, make_response

app = Flask(__name__)

@app.route('/forecast', methods=["POST"])
def forecast_post():
    """
    Args:
        request.data: json pandas dataframe
            example: {
                "columns": ["date", "open", "high", "low", "close",
                   "volume"],
                "index":[1, 0],
                "data": [
                   [1532390400000, 108, 108, 107, 107, 26316],
                   [1532476800000, 107, 111, 107, 110, 30702]]
            }
    """
    if request.data:
        df = pd.read_json(request.data, orient='split')
        X = preprocess(df)
        model = pickle.load(open("dtree_model.pkl", "rb"))
        y_pred = run_model(X, model)
        resp = make_response(jsonify({
           "y_pred": json.dumps(y_pred.tolist())
        }), 200)
        return resp
    else:
        return make_response(jsonify({"message": "no data"}), 400)

pd.read_json(…): reads data from POST request which is a json object corresponding to price data formatted the same way as Yahoo finance prices (our original data source)

preprocess(…): copy of our code from the Preprocessing tutorial that manipulates raw price data into features. Importantly, the scaler used must be the exact same we used in Preprocessing so it has to be saved to pickle file first during Preprocessing and loaded from pickle now

run_model(…): loads and runs our saved final model from the Training tutorial

make_response(…): returns forecasts

Heroku

Deploying our prediction code to Heroku will require that we collect at least two necessary pieces of our code from previous tutorials: the final model (saved as a pickle file) and the code from the Preprocessing tutorial that transforms the original features we collected from the real world to features our model can handle.

I will not go into details about how to deploy a Docker app on Heroku. There are plenty of good materials including Heroku’s documentation, which is excellent. All the necessary code to run and deploy the Docker app on Heroku is also in the Github repo. There are a few key steps to remember:

Save Dockerfile as Dockerfile.web which is a container of all code necessary to run the app

Deploy container using command heroku container:push

Release container using command heroku container:release

At this point our code is deployed which we can test using Postman to make a manual forecast request:

The date is represented by Unix timestamp. The first Body window consists of inputs we provide to the endpoint in the form of prices. The second window returns forecasts from the app.

Testing

To test the implementation, I will reuse the code from the Evaluation step. However, instead of making predictions locally using our sklearn model, I will use the Heroku app to predict the 691 samples from Evaluation as a batch. The goal is for our predictions we made on a local machine to perfectly match those made using our deployment stack.

This step is critical to ensure that we can replicate our results remotely using a pre-trained model. The testing code is also available on Github. We confirm that the performance of our Heroku app matches the performance generated locally in the Evaluation tutorial:

To conclude, the project is intended to provide an overview of the kind of thinking a data science project entails. The code should not be used in production and is provided solely for illustrative purposes. As always, I welcome all constructive feedback (positive or negative) on Twitter.

Other tutorials in this series: #1 Preprocessing, #2 Training, #3 Evaluation, #4 Deployment (this article). Github code.

Machine Learning Tutorial #3: Evaluation

2018-08-19T00:00:00-05:00

In this third phase of the series, I will explore the Evaluation part of the ML project. I will reuse some of the code and solutions from the second Training phase. However, it is important to note that the Evaluation phase should be completely separate from training except for using the final model produced in the Training step. Other tutorials in this series: #1 Preprocessing, #2 Training, #3 Evaluation (this article), #4 Prediction. Github code.

Performace Metrics

The goal of this section is to determine how our model from the Training step performs on real life data it has not learned from. First, we have to load the model we saved as the Final model:

model = pickle.load(open("dtree_model.pkl", "rb"))
>>> model
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=5, min_weight_fraction_leaf=0.0, presort=False, random_state=1, splitter='best')

Next, we will load the testing data we created in the Preprocessing part of this tutorial. The primary reason why I keep the Evaluation section separate from Training is precisely this step. I keep the code separate as well to ensure that no information from training leaks into evaluation. To restate, we should have not seen the data used in this section at any point until now.

X = pd.read_csv("X_test.csv", header=0)
y = pd.read_csv("y_test.csv", header=0)

At this stage, we may perform additional performance evaluation on top of the Training step. However, I will stick to the metrics used previously: MAE, MSE, R2.

Commentary

We have known that our model does not perform well enough in practice from the previous tutorial already. However, as I mentioned before, I went ahead and used it for illustrative purposes here in order to complete the tutorial and to explain the kind of thinking involved in real life projects where performance is not always ideal out of the box as many toy datasets would make one think.

The key comparison is how well does our model evaluate relative to the training phase. In the case of models ready for production, I would expect the performance in the Evaluation step to be comparable to those of testing folds in the Training phase.

Comparing the last training test fold here (5249 datapoints used to train) and the Evaluation results above:

MAE: final Training phase ~10^-2. Evaluation phase ~10^-2

MSE: final Training phase ~10^-4. Evaluation phase ~10^-3

R²: final Training phase ~0. Evaluation phase ~0

The performance on dataset the model has never seen before is reasonably similar. Nonetheless, overfitting is still something to potentially address. If we had a model ready for production from the Training phase, we would be reasonably confident at this stage that it would perform as we expect on out of sample data.

Other tutorials in this series: #1 Preprocessing, #2 Training, #3 Evaluation (this article), #4 Prediction

Machine Learning Tutorial #2: Training

2018-08-12T00:00:00-05:00

This second part of the ML Tutorial follows up on the first Preprocessing part. All code is available in this Github repo. Other tutorials in this series: #1 Preprocessing, #2 Training (this article), #3 Evaluation , #4 Prediction

I concluded Tutorial #1 with 4 datasets: training features, testing features, training target variables, and testing target variables. Only training features and and training target variables will be used in this Tutorial #2. The testing data will be used for evaluation purposes in Tutorial #3.

Performance Metrics

We are focused on regression algorithms so I will consider 3 most often used performance metrics

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

In practice, a domain-specific decision could be made to supplement the standard metrics above. For example, investors are typically more concerned about significant downside errors rather than upside errors. As a result, a metric could be derived that overemphasizes downside errors corresponding to financial losses.

Cross Validation

I will return to the same topic I addressed in Preprocessing. Due to the nature of time series data, standard randomized K-fold validation produces forward looking bias and should not be used. To illustrate the issue here, let’s assume that we split 8 years of data into 8 folds, each representing one year. The first training cycle will use folds #1–7 for training and fold #8 for testing. The next training cycle may use folds #2–8 for training and fold #1 for testing. This is of course unacceptable because we are using data from years 2–7 to forecast year 1.

Our cross validation must respect the temporal sequence of the data. We can use Walk Forward Validation or simply multiple Train-Test Splits. For illustration, I will use 3 Train-Test splits. For example, let’s assume we have 2000 samples sorted by timestamp from the earliest. Our 3 segments would look as follows:

Model Selection

In this section, I will select the models to train. The “Supervised” algorithms section (red section in the image above) is relevant because the dataset contains both features and labels (target variables). I like to follow Occam’s razor when it comes to algorithms selection. In other words, start with the algorithm that exhibits the fastest times to train and the greatest interpretability. Then we can increase complexity.

I will explore the following algorithms in this section:

Linear Regression: fast to learn, easy to interpret

Decision Trees: fast to learn (requires pruning), easy to interpret

Neural Networks: slow to learn, hard to interpret

Linear Regression

Starting with linear regression is useful to see if we can “get away” with simple statistics to achieve our goal before diving into complex machine learning algorithms. House price forecasting with clearly defined features is an example where linear regression often works well and using more complex algorithms is unnecessary.

Training a linear regression model using sklearn is simple:

from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Initial results yielded nothing remotely promising so I took another step and transformed features further. I created polynomial and nonlinear features to account for nonlinear relationships. For example, features [a, b] become [1, a, b, a², ab, b²] in the case of degree-2 polynomial.

The x-axis represents 3 cross validation segments (the fold 1st uses 1749 samples for training and 1749 for testing, the 2nd uses 3499 for training and 1749 for testing, and the last uses 5249 for training and 1749 for testing). Clearly, the results suggest that the linear model is not useful in practice. At this stage I have at least the following options:

Ridge regression: addresses overfitting (if any)

Lasso linear: reduces model complexity

At this point, I don’t believe that any of the options above will meaningfully impact the outcome. I will move on to other algorithms to see how they compare.

Before moving on, however, I need to set expectations. There is a saying in finance that successful forecasters only need to be correct 51% of the time. Financial leverage can be used to magnify results so being just a little correct produces impactful outcomes. This sets expectations because we will never find algorithms that are constantly 60% correct or better in this domain. As a result, we expect low R² values. This needs to be said because many sample projects in machine learning are designed to look good, which we can never match in real-life price forecasting.

Decision Tree

Training a decision tree regressor model using sklearn is equally simple:

from sklearn import tree
model = tree.DecisionTreeRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

The default results for the fit function above almost always overfit. Decision trees have a very expressive hypothesis space so they can represent almost any function when not pruned. R² for training data can easily become perfect 1.0 while for testing data the result will be 0. We therefore need to use the max_depth argument of scikit-learn DecisionTreeRegressor to enforce that the tree generalizes well for test data.

One of the biggest advantages of decision trees is their interpretability: see many useful visualization articles using standard illustrative datasets.

Neural Networks

Scikit-learn makes simple neural network training just as simple as building a decision tree:

from sklearn.neural_network import MLPRegressor
model = MLPRegressor(hidden_layer_sizes=(200, 200), solver="lbfgs", activation="relu")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Training a neural net with 2 hidden layers (of 200 units each) and polynomial features starts taking tens of seconds on an average laptop. To speed up the training process in the next section, I will step away from scikit-learn and use Keras with TensorFlow backend.

Keras API is equally simple. The project even includes wrappers for scikit-learn to take advantage of scikit’s research libraries.

from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
input_size = len(X[0])
model.add(Dense(200, activation="relu", input_dim=input_size))
model.add(Dense(200, activation="relu"))
model.add(Dense(1, activation="linear"))
model.compile(optimizer="adam", loss="mse")
model.fit(X_train, y_train, epochs=25, verbose=1)
y_pred = model.predict(X_test)

Hyperparameter Optimization

The trick to doing hyperparameter optimization is to understand that parameters should not be treated independently. Many parameters interact with each other which is why exhaustive grid search is often performed. However, grid search is that it becomes expensive very quickly.

Decision Tree

Our decision tree grid search will iterate over the following inputs:

splitter: strategy used to split nodes (best or random)

max depth of the tree

min samples per split: the minimum number of samples required to split an internal node

max leaf nodes: number or None (allow unlimited number of leaf nodes)

Illustrative grid search results are below:

Performance using the best parameters:

Again, the results do not seem to be very promising. They appear to be better than linear regression (lower MAE and MSE) but R² is still too low to be useful. I would conclude, however, that the greater expressiveness of decision trees is useful and I would discard the linear regression model at this stage.

Neural Networks

Exploring the hyperparameters of the neural net build by Keras, we can alter at least the following parameters:

number of hidden layers and/or units in each layer

model optimizer (SGD, Adam, etc)

activation function in each layer (relu, tanh)

batch size: the number of samples per gradient update

epochs to train: the number of iterations over the entire training dataset

Illustrative grid search results are below:

Using the best parameters, we obtain the following performance metrics:

Neural net and decision tree results are similar which is common. Both algorithms have very expressive hypothesis spaces and often produce comparable results. If I achieve comparable results, I tend to use the decision tree model for its faster training times and greater interpretability.

Project Reflection

At this stage it becomes clear that no model can be used in production. While the decision tree model appears to perform the best, its performance on testing data is still unreliable. At this stage, it would be time to go back and find additional features and/or data sources.

As I mentioned in the first Preprocessing Tutorial, finance practitioners might spend months sourcing data and building features. Domain-specific knowledge is crucial and I would argue that financial markets exhibit at least the Weak-Form of Efficient Market Hypothesis. This implies that future stock returns cannot be predicted from past price movements. I have used only past price movements to develop the models above so practitioners would notice already in the first tutorial that results would not be promising.

For the sake of completing this tutorial, I will go ahead and save the decision tree model and use it for illustrative purposes in the next sections of this tutorial (as if it were the Final production model):

pickle.dump(model, open("dtree_model.pkl", "wb"))

Important: there are known security vulnerabilities in the Python pickle library. To stay on the safe side, the key takeaway is to never unpickle data you did not create.

Tools

Tooling is a common question but often not critical until the project is composed of tens of thousands of examples and at least hundreds of features. I typically start with scikit-learn and move elsewhere when performance becomes the bottleneck. TensorFlow, for example, is not just a deep learning framework but also contains other algorithms such as LinearRegressor. We could train Linear Regression above with TensorFlow and GPUs if scikit-learn does not perform well enough.

Other tutorials in this series: #1 Preprocessing, #2 Training (this article), #3 Evaluation , #4 Prediction

Machine Learning Tutorial #1: Preprocessing

2018-08-05T00:00:00-05:00

In this machine learning tutorial, I will explore 4 steps that define a typical machine learning project: Preprocessing, Learning, Evaluation, and Prediction (deployment). In this first part, I will complete the Preprocessing step. Other tutorials in this series: #1 Preprocessing (this article), #2 Training, #3 Evaluation , #4 Prediction

I will use stock price data as the main dataset. There are a few reasons why this is a good choice for the tutorial:

The dataset is public by definition and can be easily downloaded from multiple sources so anyone can replicate the work.

Not all features are immediately available from the source and need to be extracted using domain knowledge, resembling real life.

The outcome of the project is highly uncertain which again simulates real life. Billions of dollars are thrown at the stock price prediction problem every year and the vast majority of projects fail. This tutorial is therefore not about creating a magical money-printing machine; it is about replicating the experience a machine learning engineer might have with a project.

All code is located at the following Github repo. The file “preprocessing.py” drives the analysis. Python 3.6 is recommended and the file includes directions to setup all necessary dependencies.

First we need to download the dataset. I will somewhat arbitrarily choose the Microsoft stock data (source: Yahoo Finance). I will use the entire available history which at the time of writing includes 3/13/1986 — 7/30/2018. The share price performed as follows during this period:

The price movement is interesting because it exhibits at least two modes of behavior:

the steep rise until the year 2000 when tech stocks crashed

the sideways movement since 2000

This makes for a number of interesting machine learning complexities such as the sampling of training and testing data.

Data Cleaning

After some simple manipulations and loading of the csv data into pandas DataFrame, we have the following dataset where open, high, low and close represent prices on each date and volume the total number of shares traded.

Missing values are not present which I confirmed by running the following command:

missing_values_count = df.isnull().sum()

Outliers are the next topic I need to address. The key point to understand here is that our dataset now includes prices but prices are not the metric I will attempt to forecast because they are measured in absolute terms and therefore harder to compare across time and other assets. In the tables above, the first price available is ~$0.07 while the last is $105.37.

Instead, I will attempt to forecast daily returns. For example, at the end of the second trading day the return was +3.6% (0.073673/0.071132). I will therefore create a return column and use it to analyze possible outliers.

The 5 smallest daily returns present in the dataset are the following:

And 5 largest daily returns:

The most negative return is -30% (index 405) and the largest is 20% (index 3692). Normally, a further domain-specific analysis of the outliers is necessary here. I will skip it for now and assume this tutorial outlines the process for illustrative purposes only. Generally, the data appears to make sense given that in 1987 and 2000 market crashes took place associated with extremely volatility.

The same analysis would be required for open, high, low and volume columns. Admittedly, data cleaning was somewhat academic because Yahoo Finance is a very widely used and reliable source. It is still a useful exercise to understand the data.

Target Variable Selection

We need to define what our ML algorithms will attempt to forecast. Specifically, we will forecast next day’s return. The timing of returns is important here so we are not mistakenly forecasting today’s or yesterday’s return. The formula to define tomorrow’s return as our target variable is as follows:

df["y"] = df["return"].shift(-1)

Feature Extraction

Now I will turn to some simple transformations of the prices, returns and volume to extract features ML algorithms can consume. Finance practitioners have developed 100s of such features but I will only show a few. Hedge funds spent the vast majority of time on this step because ML algorithms are generally only as useful as the data available, aka. “garbage in, garbage out”.

One feature we might consider is how today’s closing price relates to that of 5 trading days ago (one calendar week). I call this feature “5d_momentum”:

df[“5d_momentum”] = df[“close”] / df[“close”].shift(5)

One typical trend following feature is MACD (Moving Average Convergence/Divergence Oscillator). The strengths of pandas shine here because MACD can be created in only 4 lines of code. The chart of the MACD indicator is below. On the lower graph, a typical buy signal would be the blue “macd_line” crossing above the orange line representing a 9-day exponential moving average of the “macd_line”. The inverse would represent a sell signal.

The python code “generate_features.py” located in the Github repo mentioned above includes additional features we might consider. For example:

Trend: Moving Average

Trend: Parabolic SAR

Momentum: Stochastic Oscillator

Momentum: Commodity Channel Index (CCI)

Momentum: Relative Strength Index (RSI)

Volatility: Bollinger Bands

Volatility: Average True Range

Volume: On Balance Volume (OBV)

Volume: Chaikin Oscillator

At the end of the feature extraction process, we have the following features:

['return', 'close_to_open', 'close_to_high', 'close_to_low', 'macd_diff', 'ma_50_200', 'sar', 'stochastic_oscillator', 'cci', 'rsi', '5d_volatility', '21d_volatility', '60d_volatility', 'bollinger', 'atr', 'on_balance_volume', 'chaikin_oscillator']

Sampling

We need to split the data into training and testing buckets. I cannot stress enough that the testing dataset should never be used in the Learning step. It will be used only in the Evaluation step so that performance metrics are completely independent of training and represent an unbiased estimate of actual performance.

Normally, we could randomize the sampling of testing data but time series data is often not well suited for randomized sampling. The reason being that would would bias the learning process. For example, randomization could produce a situation where the data point from 1/1/2005 is used in the Learning step to later forecast a return from 1/1/2003.

I will therefore choose a much simpler way to sample the data and use the first 7000 samples as training dataset for Learning and the remaining 962 as testing dataset for Evaluation.

Both datasets will be saved as csv files so we conclude this part of the ML tutorial by storing 4 files (MSFT_X_learn.csv, MSFT_y_learn.csv, MSFT_X_test.csv, MSFT_y_test.csv). These will be consumed by the next steps of this tutorial.

Scaling

Feature scaling is used to reduce the time to Learn. This typically applies to stochastic gradient descent and SMV.

The open source sklearn package will be used for most additional ML application so I will start using it here to scale all features to have zero mean and unit variance:

from sklearn import preprocessing
scaler_model = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler_model.transform(X_train)
X_test_scaled = scaler_model.transform(X_test)

It is important that data sampling takes place before features are modified to avoid any training to testing data leakage.

Dimensionality Reduction

At this stage, our dataset 17 features. The number of features has a significant impact on the speed of learning. We could use a number of techniques to try to reduce the number of features so that only the most “useful” features remain.

Many hedge funds would be working with 100s of features at this stage so dimensional reduction would be critical. In our case, we only have 17 illustrative features so I will keep them all in the dataset until I explore the learning times of different algorithms.

Out of curiosity however, I will perform Principal Component Analysis (PCA) to get an idea of how many features we could create from our dataset without losing meaningful explanatory power.

from sklearn.decomposition import PCA
sk_model = PCA(n_components=10)
sk_model.fit_transform(features_ndarray)
print(sk_model.explained_variance_ratio_.cumsum())
 [0.30661571 0.48477408 0.61031358 0.71853895 0.78043556 0.83205298
 0.8764804  0.91533986 0.94022672 0.96216244]

The first 8 features explain 91.5% of data variance. The downside of PCA is that new features are located in a lower dimensional space so they no longer correspond the real-life concepts. For example, the first original feature could be “macd_line” I derived above. After PCA, the first feature explains 31% of variance but we not longer have any logical description for what the feature represents in real life.

For now, I will keep all features 17 original features but note that if the learning time of algorithms is too slow, PCA will be helpful.

Other tutorials in this series: #1 Preprocessing (this article), #2 Training, #3 Evaluation , #4 Prediction

Linear programming in Python: CVXOPT and game theory

2017-08-16T00:00:00-05:00

CVXOPT is an excellent Python package for linear programming. However, when I was getting started with it, I spent way too much time getting it to work with simple game theory example problems. This tutorial aims to shorten the startup time for everyone trying to use CVXOPT for more advanced problems.

All code is available here.

Installation of dependencies:

Using Docker is the fastest way to run the code. In only 5 commands you can replicate my environment and run the code.

Alternatively, the code has the following dependencies: Python (3.5.3), numpy (1.12.1), cvxopt (1.1.9), glpk optimizer (but you can use the default optimizer, glpk is better for some more advanced problems)

Please review how CVXOPT solves simple maximization problems. While this article focuses on game theory problems, it is critical to understand how CVXOPT defines optimization problems in general.

The first problem we will solve is a 2-player zero-sum game.

The constraints matrix A is defined as

A = [[3, -2, 2], [-1, 0, 4] ,[-4, -3, 1]]

Next, we define a maxmin helper function

def maxmin(self, A, solver="glpk"):
    num_vars = len(A)
     minimize matrix c
    c = [-1] + [0 for i in range(num_vars)]
    c = np.array(c, dtype="float")
    c = matrix(c)
     constraints G*x <= h
    G = np.matrix(A, dtype="float").T  reformat each variable is in a row
    G *= -1  minimization constraint
    G = np.vstack([G, np.eye(num_vars) * -1])  > 0 constraint for all vars
    new_col = [1 for i in range(num_vars)] + [0 for i in range(num_vars)]
    G = np.insert(G, 0, new_col, axis=1)  insert utility column
    G = matrix(G)
    h = ([0 for i in range(num_vars)] + 
         [0 for i in range(num_vars)])
    h = np.array(h, dtype="float")
    h = matrix(h)
     contraints Ax = b
    A = [0] + [1 for i in range(num_vars)]
    A = np.matrix(A, dtype="float")
    A = matrix(A)
    b = np.matrix(1, dtype="float")
    b = matrix(b)
    sol = solvers.lp(c=c, G=G, h=h, A=A, b=b, solver=solver)
    return sol

Last, we use the maxmin helper function to solve our example problem:

sol = maxmin(A=A, solver=”glpk”)
probs = sol[“x”]
print(probs)
 [ 1.67e-01]
 [ 8.33e-01]
 [ 0.00e+00]

In other words, player A chooses action 1 with probility 1/6 and action 2 with probability 5/6.

Next we will solve a Correlated Equilibrium problem called Game of Chicken as defined on page 3 of this document. The constraints matrix A is defined as

A = [[6, 6], [2, 7], [7, 2], [0, 0]]

Next, we define a ce and build_ce_constraints helper functions:

def ce(self, A, solver=None):
    num_vars = len(A)
     maximize matrix c
    c = [sum(i) for i in A]  sum of payoffs for both players
    c = np.array(c, dtype="float")
    c = matrix(c)
    c *= -1  cvxopt minimizes so *-1 to maximize
     constraints G*x <= h
    G = self.build_ce_constraints(A=A)
    G = np.vstack([G, np.eye(num_vars) * -1])  > 0 constraint for all vars
    h_size = len(G)
    G = matrix(G)
    h = [0 for i in range(h_size)]
    h = np.array(h, dtype="float")
    h = matrix(h)
     contraints Ax = b
    A = [1 for i in range(num_vars)]
    A = np.matrix(A, dtype="float")
    A = matrix(A)
    b = np.matrix(1, dtype="float")
    b = matrix(b)
    sol = solvers.lp(c=c, G=G, h=h, A=A, b=b, solver=solver)
    return sol

def build_ce_constraints(self, A):
    num_vars = int(len(A) ** (1/2))
    G = []
     row player
    for i in range(num_vars):  action row i
        for j in range(num_vars):  action row j
            if i != j:
                constraints = [0 for i in A]
                base_idx = i * num_vars
                comp_idx = j * num_vars
                for k in range(num_vars):
                    constraints[base_idx+k] = (- A[base_idx+k][0]
                                               + A[comp_idx+k][0])
                G += [constraints]
     col player
    for i in range(num_vars):  action column i
        for j in range(num_vars):  action column j
            if i != j:
                constraints = [0 for i in A]
                for k in range(num_vars):
                    constraints[i + (k * num_vars)] = (
                        - A[i + (k * num_vars)][1]
                        + A[j + (k * num_vars)][1])
                G += [constraints]
    return np.matrix(G, dtype="float")

Using the helper functions, we solve the Game of Chicken

sol = ce(A=A, solver="glpk")
probs = sol["x"]
print(probs)
 [ 5.00e-01]
 [ 2.50e-01]
 [ 2.50e-01]
 [ 0.00e+00]

In other words, the optimal strategy is for both players to select actions [6, 6] 50% of the time, actions [2, 7] 25% of the time, and action [7, 2] also 25% of the time.

Hopefully this overview helps in getting you started with linear programming and game theory in Python.

Credits: cvxopt.org/examples/tutorial/lp.html , cs.duke.edu/courses/fall12/cps270/lpandgames.pdf , en.wikipedia.org/wiki/Minimax#Example , https://www3.ul.ie/ramsey/Lectures/Operations_Research_2/gametheory4.pdf , cs.rutgers.edu/~mlittman/topics/nips02/nips02/greenwald.ps , cs.duke.edu/courses/fall16/compsci570/LPandGames.pdf

ANotes

Deploying Language Models With Gradio On Hugging Face

Example of Gradio UI deployed on Hugging Face

Machine Learning Notes

Contents

Algorithms

Bayes

Explainability

MLOps

Model evaluation

Preprocessing

Reinforcement Learning

SQL

Statistics

Machine Learning Docker Template

Contents

Summary

Code

Keras LSTM Forecasting Using Synthetic Data

Contents

Summary

Dataset

Model training

Model evalution

Notebook

Scikit-learn Pipeline with Feature Engineering

Contents

Summary

Notebook

Global Temperature Forecast Using Prophet and CO2

Data

Forecast

Validation

Berkeley Earth Global Temperature Data

Dynamic HTML with Python, AWS Lambda, and Containers

1. Dockerfile

App code

Calling and testing the app locally

Cost

Google Colab and Auto-sklearn with Profiling

Google Colab and AutoML: Auto-sklearn Setup

Google Paper: 24/7 by 2030

Serving Dynamic Web Pages using Python and AWS Lambda

Step 0 — Optional

Step 1 — Write html template

Step 2 — Write Lambda function to serve your html page

Custom VPN using PiVPN and public cloud

Create public cloud compute instance

Create PiVPN instance

Create VPN credentials

Download a VPN client for your platform

Final notes:

Serverless web apps with Firebase and AWS Lambda

Firebase hosting

AWS Lambda

Machine Learning Tutorial #4: Deployment

Stack Selection

As-a-service solution

Serverless function

Custom backend code

Heroku

Testing

Machine Learning Tutorial #3: Evaluation

Performace Metrics

Commentary

Machine Learning Tutorial #2: Training

Performance Metrics

Cross Validation

Model Selection

Linear Regression

Decision Tree

Neural Networks

Hyperparameter Optimization

Decision Tree

Neural Networks

Project Reflection

Tools

Machine Learning Tutorial #1: Preprocessing

Data Cleaning

Target Variable Selection