ANoteshttps://adamnovotny.com/2023-10-14T00:00:00-05:00Deploying Language Models With Gradio On Hugging Face2023-10-14T00:00:00-05:002023-10-14T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2023-10-14:/blog/deploying-language-models-with-gradio-on-huggingface.html<p>Machine learning models (including language models) can be easily deployed using generous <a href="https://huggingface.co/pricing#spaces">free tier on Hugging Face</a> and a python-based open source UI tool <a href="https://www.gradio.app/guides/quickstart">Gradio</a> by following these steps.</p>
<p>See live deployed app and source code <a href="https://huggingface.co/spaces/AdamNovotnyCom/llama2-gradio-huggingface">here</a></p>
<ol>
<li>
<p>For local development, create the <a href="https://huggingface.co/spaces/AdamNovotnyCom/llama2-gradio-huggingface/blob/main/Dockerfile_dev">following Dockerfile</a>. It differs from production Dockerfile in …</p></li></ol><p>Machine learning models (including language models) can be easily deployed using generous <a href="https://huggingface.co/pricing#spaces">free tier on Hugging Face</a> and a python-based open source UI tool <a href="https://www.gradio.app/guides/quickstart">Gradio</a> by following these steps.</p>
<p>See live deployed app and source code <a href="https://huggingface.co/spaces/AdamNovotnyCom/llama2-gradio-huggingface">here</a></p>
<ol>
<li>
<p>For local development, create the <a href="https://huggingface.co/spaces/AdamNovotnyCom/llama2-gradio-huggingface/blob/main/Dockerfile_dev">following Dockerfile</a>. It differs from production Dockerfile in how secrets are loaded and the use of <pre>CMD ["gradio", "app.py"]</pre> which runs (and reloads) source files every time a change is noticed.</p>
</li>
<li>
<p><a href="https://huggingface.co/spaces/AdamNovotnyCom/llama2-gradio-huggingface/blob/main/docker-compose.yml">docker-compose</a> will launch the development Dockerfile using command <pre>export HF_TOKEN=paste_HF_token && docker-compose -f docker-compose.yml up gradiohf</pre> where HF_TOKEN is an optional personal token provided by Hugging Face to ensure that license restrictions are being followed for certain models (such as Llama 2).</p>
</li>
<li>
<p>Develop your Gradio <a href="https://huggingface.co/spaces/AdamNovotnyCom/llama2-gradio-huggingface/blob/main/app.py">app.py</a>. This deployed example represents the absolute smallest version that selects a language model based on environmenal variable <strong>os.environ.get("MODEL")</strong>. The selections includes Llama 2 which will require a paid Spaces plan to run on Hugging Face (with no code changes!). The live example runs a small <em>toy</em> model <a href="https://huggingface.co/google/flan-t5-small">google/flan-t5-small</a> that easily runs on the free tier.</p>
</li>
<li>
<p>View your Gradio app running locally in browser: <pre>http://0.0.0.0:7860</pre></p>
</li>
<li>
<p>Create production <a href="https://huggingface.co/spaces/AdamNovotnyCom/llama2-gradio-huggingface/blob/main/Dockerfile">Dockerfile</a> and deploy on Hugging Face Spaces using this <a href="https://huggingface.co/docs/hub/spaces-sdks-docker">great documentation</a>.</p>
</li>
</ol>
<h4>Example of Gradio UI deployed on Hugging Face</h4>
<p><a href="/theme/images/deploying-language-models-with-gradio-on-huggingface-overview.png"><img src="/theme/images/deploying-language-models-with-gradio-on-huggingface-overview.png" alt="Normal equation" style="width: 100%" loading="lazy"></a></p>Machine Learning Notes2022-05-07T00:00:00-05:002022-05-07T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2022-05-07:/blog/machine-learning-notes.html<p><iframe
title="ML notes notebook"
width="100%"
height="5000px"
src="/notebooks/ml_notes.html">
</iframe>
</p>
<h4>Contents</h4>
<ul>
<li><a href="#algorithms">Algorithms</a></li>
<li><a href="#bayes">Bayes</a></li>
<li><a href="#explainability">Explainability</a></li>
<li><a href="#mlops">MLOps</a></li>
<li><a href="#model_evaluation">Model Evaluation</a></li>
<li><a href="#preprocessing">Preprocessing</a></li>
<li><a href="#reinforcement_learning">Reinforcement Learning</a></li>
<li><a href="#sql">SQL</a></li>
<li><a href="#statistics">Statistics</a></li>
</ul>
<h4>Algorithms <span id="algorithms"></span></h4>
<ul>
<li>K-means: aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion. Use the <a href="https://www.scikit-yb.org/en/latest/api/cluster/elbow.html">“elbow” method</a> to identify the right number of means. <a href="https://scikit-learn.org/stable/modules/clustering.html#k-means">scikit tutorial</a></li>
<li>KNN: Simple, flexible, naturally handles multiple classes. Slow at scale, sensitive …</li></ul><p><iframe
title="ML notes notebook"
width="100%"
height="5000px"
src="/notebooks/ml_notes.html">
</iframe>
</p>
<h4>Contents</h4>
<ul>
<li><a href="#algorithms">Algorithms</a></li>
<li><a href="#bayes">Bayes</a></li>
<li><a href="#explainability">Explainability</a></li>
<li><a href="#mlops">MLOps</a></li>
<li><a href="#model_evaluation">Model Evaluation</a></li>
<li><a href="#preprocessing">Preprocessing</a></li>
<li><a href="#reinforcement_learning">Reinforcement Learning</a></li>
<li><a href="#sql">SQL</a></li>
<li><a href="#statistics">Statistics</a></li>
</ul>
<h4>Algorithms <span id="algorithms"></span></h4>
<ul>
<li>K-means: aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion. Use the <a href="https://www.scikit-yb.org/en/latest/api/cluster/elbow.html">“elbow” method</a> to identify the right number of means. <a href="https://scikit-learn.org/stable/modules/clustering.html#k-means">scikit tutorial</a></li>
<li>KNN: Simple, flexible, naturally handles multiple classes. Slow at scale, sensitive to feature scaling and irrelevant features. <a href="https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification">scikit tutorial</a></li>
<li>Linear Discriminant Analysis (LDA): A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix. <a href="https://scikit-learn.org/stable/modules/lda_qda.html#mathematical-formulation-of-the-lda-and-qda-classifiers">scikit tutorial</a></li>
<li>Linear regression<ul>
<li>assumptions (LINE) <a href="https://online.stat.psu.edu/stat500/lesson/9/9.2/9.2.3#paragraph--3265">source</a><ul>
<li>Linearity</li>
<li>Independence of errors</li>
<li>Normality of errors</li>
<li>Equal variances</li>
<li>Tests of assumptions: i) plot each feature on x-axis vs y_error, ii) plot y_predicted on x-axis vs y_error, iii) histogram of errors.</li>
</ul>
</li>
<li>Overspecified model can be used for prediction of the label, but should not be used to ascribe the effect of a feature on the label.</li>
<li><a href="http://cecas.clemson.edu/~ahoover/ece854/lecture-notes/lecture-normeqs.pdf">Linear algebra solution</a><a href="/theme/images/1*i0ylsCBDeVY5rFlGa9AYWg.png.png"><img src="/theme/images/1*i0ylsCBDeVY5rFlGa9AYWg.png.png" alt="Normal equation" style="width: 100%" loading="lazy"></a></li>
</ul>
</li>
<li>Naive Bayes: uses naive conditional independence assumption of features. <a href="https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes">scikit</a></li>
<li>PCA: transform data using k vectors that minimize the perpendicular distance to points. PCA can be also thought of as an <a href="https://online.stat.psu.edu/stat505/lesson/11/11.2">eigenvalue/engenvector decomposition</a>. <a href="https://scikit-learn.org/stable/modules/decomposition.html#pca">scikit</a>. <a href="https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf">Intuition paper</a></li>
<li>Pearson’s correlation coefficient**. <a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient">wiki</a>. <a href="/theme/images/1*qtdPV-XQhTYACKS7beLDpg.jpeg.png"><img src="/theme/images/1*qtdPV-XQhTYACKS7beLDpg.jpeg.png" alt="Correlation formula" style="width: 100%" loading="lazy"></a></li>
<li>Random Forests: each tree is built using a sample of rows (with replacement) from training set. Less prone to overfitting. <a href="https://scikit-learn.org/stable/modules/ensemble.html#random-forests">scikit</a></li>
<li>RNN: <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">Karpathy tutorial</a></li>
<li>Sorting <a href="https://lamfo-unb.github.io/2019/04/21/Sorting-algorithms">tutorial</a>. <a href="/theme/images/rO1H18bCodMa.png"><img src="/theme/images/rO1H18bCodMa.png" alt="Ridge Regression" style="width: 100%" loading="lazy"></a></li>
<li>Stochastic gradient descent <a href=" https://realpython.com/gradient-descent-algorithm-python/#basic-gradient-descent-algorithm">tutorial</a>. Calculus solution: <a href="/theme/images/1*_6C1R-IamnPtIo0jLOoblw.png.png"><img src="/theme/images/1*_6C1R-IamnPtIo0jLOoblw.png.png" alt="Stochastic gradient descent cost function" style="width: 100%" loading="lazy"></a></li>
<li>SVD: Singular Value Decomposition <a href="https://towardsdatascience.com/svd-8c2f72e264f"> intuition with PCA use case</a></li>
<li>SVM: Effective in high dimensional spaces (or when number of dimensions > number of examples). SVMs do not directly provide probability estimates. <a href="https://scikit-learn.org/stable/modules/svm.html#svm-classification">scikit</a></li>
<li>Transformers <a href="https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/">tutorial</a><a href="/theme/images/1232021073114943.png"><img src="/theme/images/1232021073114943.png" alt="Original transformer architecture" style="width: 100%" loading="lazy"></a></li>
</ul>
<h4>Bayes <span id="bayes"></span></h4>
<ul>
<li><a href="https://www.nature.com/articles/s43586-020-00001-2">Nature article overview</a></li>
<li><a href="https://towardsdatascience.com/bayesian-a-b-testing-in-pymc3-54dceb87af74">Bayesian A/B Testing in PyMC3</a></li>
<li><a href="https://towardsdatascience.com/bayesian-inference-intuition-and-example-148fd8fb95d6">Inference — Intuition and Example (Beta & Binomial)</a></li>
</ul>
<h4>Explainability <span id="explainability"></span></h4>
<ul>
<li>Books: <a href="https://christophm.github.io/interpretable-ml-book/">Interpretable Machine Learning</a></li>
<li>Tutorials: <a href="https://www.twosigma.com/articles/interpretability-methods-in-machine-learning-a-brief-survey/">twosigma: a brief survey</a></li>
<li>EthicalML tools <a href="https://github.com/EthicalML/awesome-production-machine-learning#explaining-black-box-models-and-datasets">EthicalML github</a></li>
<li>Partial dependence plots (PDP): x-axis = value of a single feature, y-axis = label. <a href="https://scikit-learn.org/stable/modules/partial_dependence.html#partial-dependence-plots">scikit</a></li>
<li>Individual conditional expectation (ICE): x-axis = value of a single feature, y-axis = label. <a href="https://scikit-learn.org/stable/modules/partial_dependence.html#individual-conditional-expectation-ice-plot">scikit</a></li>
<li>Permutation feature importance: Randomly shuffle features and calculate impact on model metrics such as F1. <a href="https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-feature-importance">scikit</a></li>
<li>Global surrogate: train an easily interpretable model (such as liner regression) on the predictions made by a black box model</li>
<li>Local Surrogate: LIME (for Local Interpretable Model-agnostic Explanations). Train individual models to approximate an individual prediction by removing features to learn their impact on the prediction</li>
<li>Shapley Value (SHAP): The contribution of each feature is measured by adding and removing it from all other feature subsets. The Shapley Value for one feature is the weighted sum of all its contributions</li>
</ul>
<h4>MLOps <span id="mlops"></span></h4>
<ul>
<li><a href="https://github.com/bentoml/BentoML">BentoML</a>: open platform that simplifies ML model deployment by saving models in a standard format, defining a web service with pre/post processing, deploying the web service in a container</li>
<li>Data <a href="https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/">a16z</a>
<a href="/theme/images/1*LYBSxf0MPcERPzkEJlk9Cw.png.png"><img src="/theme/images/1*LYBSxf0MPcERPzkEJlk9Cw.png.png" alt="A Unified Data Infra" style="width: 100%" loading="lazy"></a></li>
<li>ML Blueprint <a href="https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/">a16z</a>
<a href="/theme/images/1*MqMX4k5IupAK9T9vKs5h8g.png.png"><img src="/theme/images/1*MqMX4k5IupAK9T9vKs5h8g.png.png" alt="AI and ML Blueprint" style="width: 100%" loading="lazy"></a></li>
<li>Lifecycle <a href="https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/">AWS blog</a> <a href="/theme/images/mvaymymdlhxpalecniyphkibwaqhmboz.jpg"><img src="/theme/images/mvaymymdlhxpalecniyphkibwaqhmboz.jpg" alt="ML lifecycle" style="width: 100%" loading="lazy"></a></li>
<li>EthicalML/awesome-production-machine-learning <a href="https://github.com/EthicalML/awesome-production-machine-learning">EthicalML github</a></li>
<li>Pipeline tools <a href="https://github.com/EthicalML/awesome-production-machine-learning#data-pipeline-etl-frameworks">EthicalML/data-pipeline-etl-frameworks</a></li>
<li>MLOps Google <a href="https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning#mlops_level_2_cicd_pipeline_automation">Google</a>
<a href="/theme/images/20220115104222.png"><img src="/theme/images/20220115104222.png" alt="A Unified Data Infra" style="width: 100%" loading="lazy"></a></li>
</ul>
<h4>Model evaluation <span id="model_evaluation"></span></h4>
<ul>
<li>Classification:<ul>
<li>Recall: <a href="https://en.wikipedia.org/wiki/Precision_and_recall#Recall">wiki</a></li>
<li>Receiver operating characteristic (ROC): relates true positive rate (y-axis) and false positive rate (x-axis). TPR = TP / (TP + FN) and FPR = FP / (FP + TN). <a href="https://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc">scikit</a></li>
</ul>
</li>
<li>Regression<ul>
<li>R2: strength of a linear relationship. Could be 0 for nonlinear relationships. Never worsens with more features. <a href="https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score">scikit</a></li>
</ul>
</li>
<li>Learning curves <a href="https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#plotting-learning-curves">scikit tutorial</a> <a href="/theme/images/1*fz1sqw361u7Y_D1G-aDEmw.png.png"><img src="/theme/images/1*fz1sqw361u7Y_D1G-aDEmw.png.png" alt="Learning Curve example" style="width: 100%" loading="lazy"></a></li>
<li>Overfitting and regularization<ul>
<li>Overfitting (high variance) options: more data, increase regularization, or decrease model complexity. <a href="https://rmartinshort.jimdofree.com/2019/02/17/overfitting-bias-variance-and-leaning-curves/">tutorial</a></li>
<li>Underfitting (high bias) options: decrease regularization, increase model complexity</li>
<li>Lasso regression: linear model regularization technique with tendency to prefer solutions with fewer non-zero coefficients. <a href="https://scikit-learn.org/stable/modules/linear_model.html#lasso">scikit tutorial</a>. <a href="/theme/images/1*bvk1Esh-TGPCIub2ggNzQg.png.png"><img src="/theme/images/1*bvk1Esh-TGPCIub2ggNzQg.png.png" alt="Lasso equation" style="width: 100%" loading="lazy"></a></li>
<li>Ridge regression: imposes a penalty on the size of the coefficients
<a href="/theme/images/1*fekJIBmDHMoU2zQ6wVmQkA.png.png"><img src="/theme/images/1*fekJIBmDHMoU2zQ6wVmQkA.png.png" alt="Ridge Regression" style="width: 100%" loading="lazy"></a><a href="https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression-and-classification">scikit</a></li>
<li>Validation curve: <a href="https://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve.html#plotting-validation-curves">scikit</a><a href="/theme/images/1*HVM4sFhGDTNE40xr5aVCiQ.png.png"><img src="/theme/images/1*HVM4sFhGDTNE40xr5aVCiQ.png.png" alt="validation curve example" style="width: 100%" loading="lazy"></a></li>
</ul>
</li>
</ul>
<h4>Preprocessing <span id="preprocessing"></span></h4>
<ul>
<li><a href="https://scikit-learn.org/stable/modules/preprocessing.html">scikit</a></li>
<li>Analysis<ol>
<li>Remove duplicates</li>
<li>SOCS of each feature: Shape (skew), Outliers, Center, Spread</li>
<li>Feature correlation</li>
</ol>
</li>
<li>Production pipeline<ol>
<li>Outliers: remove or apply non-linear transformations</li>
<li>Missing values<ul>
<li>SMOTE: Generate and place a new point on the vector between a minority class point and one of its nearest neighbors, located [0, 1] percent of the way from the original point. Algorithm is parameterized with k_neighbors. <a href="https://www.kaggle.com/residentmario/oversampling-with-smote-and-adasyn">tutorial</a></li>
</ul>
</li>
<li>Standardization</li>
<li>Discretization</li>
<li>Encoding categorical features</li>
<li>Generating polynomial features</li>
<li>Dimensionality reduction</li>
</ol>
</li>
</ul>
<h4>Reinforcement Learning <span id="reinforcement_learning"></span></h4>
<ul>
<li><a href="/theme/images/1*TMQs5IMfL3k9OZwy1cck_A.png"><img src="/theme/images/1*TMQs5IMfL3k9OZwy1cck_A.png" alt="Reinforcement learning" style="width: 100%" loading="lazy"></a></li>
</ul>
<h4>SQL <span id="sql"></span></h4>
<ul>
<li>window functions, row_number() and partition(): <a href="https://docs.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql?view=sql-server-ver15#d-using-row_number-with-partition">tutorial</a></li>
<li>COALESCE(): evaluates the arguments in order and returns the current value of the first expression that initially doesn’t evaluate to NULL. <a href="https://docs.microsoft.com/en-us/sql/t-sql/language-elements/coalesce-transact-sql?view=sql-server-ver15">tutorial</a></li>
</ul>
<h4>Statistics <span id="statistics"></span></h4>
<ul>
<li><a href="https://www.statology.org/tutorials/">Statology tutorial</a></li>
<li>Means<ul>
<li>Arithmetic: <a href="https://mathworld.wolfram.com/ArithmeticMean.html">wolfram</a></li>
<li>Geometric: used in finance to calculate average growth rates and is referred to as the compounded annual growth rate. <a href="https://mathworld.wolfram.com/GeometricMean.html">wolfram</a></li>
<li>Harmonic: used in finance to average multiples like the price-earnings ratio because it gives equal weight to each data point. Using a weighted arithmetic mean to average these ratios would give greater weight to high data points than low data points because price-earnings ratios aren't price-normalized while the earnings are equalized. <a href="https://mathworld.wolfram.com/HarmonicMean.html">wolfram</a></li>
</ul>
</li>
<li>Probability distributions <a href="https://www.statology.org/statistics-socs/">Description acronym SOCS</a>: shape, outliers, center, spread. <a href="https://medium.com/@srowen/common-probability-distributions-347e6b945ce4">Comparison article</a>. <a href="/theme/images/Bf8a4LtHWOrJ.png"><img src="/theme/images/Bf8a4LtHWOrJ.png" alt="Correlation formula" style="width: 100%" loading="lazy"></a><ul>
<li>Beta: probability distribution on probabilities bounded [0, 1]. <a href="https://towardsdatascience.com/beta-distribution-intuition-examples-and-derivation-cf00f4db57af">tutorial</a></li>
<li>Binomial: probability of obtaining k successes in n binomial experiments with probability p. <a href="https://www.statology.org/binomial-distribution/">tutorial</a></li>
<li>Normal: empirical rule is sometimes called the 68-95-99.7 rule</li>
<li>Poisson: the probability of obtaining k successes during a given time interval. <a href="https://www.statology.org/poisson-distribution/">Statology tutorial</a>. <a href="https://builtin.com/data-science/poisson-process">tutorial 2</a>.<a href="https://timeseriesreasoning.com/contents/zero-inflated-poisson-regression-model/">Zero Inflated Poisson Regression Model</a></li>
</ul>
</li>
<li>Sample variance: divided by n-1 to achieve an unbiased estimator because 1 degree of freedom is used to estimate b0. <a href="https://online.stat.psu.edu/stat500/lesson/1/1.5/1.5.3#paragraph--3051">tutorial</a></li>
<li>Tests <a href="/theme/images/1*ShYx679GlV5WVL8ukd2j2w.png"><img src="/theme/images/1*ShYx679GlV5WVL8ukd2j2w.png" alt="Selecting statistical test. Source: Statistical Rethinking 2. Free Chapter 1" style="width: 100%" loading="lazy"></a><ul>
<li>ANOVA: Analysis of variance compares the means of three or more independent groups to determine if there is a statistically significant difference between the corresponding population means. <a href="https://www.statology.org/one-way-anova/">Statology tutorial</a></li>
<li>F-statistic: determines whether to reject a full model (F) in favor of a reduced (R) model. Reject full model if F is large — or equivalently if its associated p-value is small. <a href="https://online.stat.psu.edu/stat501/lesson/6/6.2#paragraph--785">tutorial</a><a href="/theme/images/1*7Vz6m3tqtLAvxF_Xqe2JAQ.png.png"><img src="/theme/images/1*7Vz6m3tqtLAvxF_Xqe2JAQ.png.png" alt="F-statistic" style="width: 100%" loading="lazy"></a></li>
<li>Linear regression coefficient CI: <a href="https://online.stat.psu.edu/stat501/node/644">tutorial</a><a href="/theme/images/1*hQ5pabjmSByQSC5O4r_uRw.png.png"><img src="/theme/images/1*hQ5pabjmSByQSC5O4r_uRw.png.png" alt="t-interval for slope parameter beta_1" style="width: 100%" loading="lazy"></a></li>
<li>T-test: <a href="https://online.stat.psu.edu/stat555/node/36/">tutorial</a><a href="/theme/images/1*R1ysZ-ofSr5wXwE0_emXiQ.png.png"><img src="/theme/images/1*R1ysZ-ofSr5wXwE0_emXiQ.png.png" alt="T-test formula" style="width: 100%" loading="lazy"></a></li>
</ul>
</li>
</ul>Machine Learning Docker Template2021-12-18T00:00:00-05:002021-12-18T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2021-12-18:/blog/machine-learning-docker-template.html<h3>Contents</h3>
<ul>
<li><a href="#summary">Summary</a></li>
<li><a href="#code">Code</a></li>
</ul>
<h3>Summary <span id="summary"></span></h3>
<p>The purpose of this post is to propose a template for machine learning projects that strives to follow these principles:</p>
<ol>
<li>All data scientists can quickly setup an identical development environment based on Docker that encourages good software engineering practices.</li>
<li>Dependency management is handled during the environment's …</li></ol><h3>Contents</h3>
<ul>
<li><a href="#summary">Summary</a></li>
<li><a href="#code">Code</a></li>
</ul>
<h3>Summary <span id="summary"></span></h3>
<p>The purpose of this post is to propose a template for machine learning projects that strives to follow these principles:</p>
<ol>
<li>All data scientists can quickly setup an identical development environment based on Docker that encourages good software engineering practices.</li>
<li>Dependency management is handled during the environment's startup by <a href="https://docs.conda.io/en/latest/miniconda.html">Miniconda</a> and requires minimal manual changes.</li>
<li>Notebooks are encouraged for exploration. However, for production purposes notebooks must be version controlled, parametrized and run using <a href="https://github.com/nteract/papermill">Papermill</a>.</li>
</ol>
<h3>Code <span id="code"></span></h3>
<p>The template is available on github <a href="https://github.com/adamnovotnycom/machine-learning-docker-template">adamnovotnycom/machine-learning-docker-template</a>. The general template structure looks as follows:
<a href="/theme/images/20220129115215.png"><img src="/theme/images/20220129115215.png" alt="ML Template" style="width: 100%" loading="lazy"></a></p>
<ol>
<li>
<p><b>Dockerfile</b> defines the development environment and uses Miniconda as base image
<pre>
FROM continuumio/miniconda3
...
RUN conda env create -f conda.yml
RUN echo "source activate dev" > ~/.bashrc
...
</pre></p>
</li>
<li>
<p><b>conda.yaml</b> is used for dependency management and includes standard data science packages.</p>
</li>
<li><b>ml_docker_template</b> package should include all production code that can be installed and run by an external system. As a result, the code can be developed locally but also easily runs on an external machine when additional compute power is needed for model training or when additional permissions are required for deployment.</li>
</ol>Keras LSTM Forecasting Using Synthetic Data2021-11-13T00:00:00-05:002021-11-13T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2021-11-13:/blog/lstm-forecast-synthetic-data.html<h3>Contents</h3>
<ul>
<li><a href="#summary">Summary</a></li>
<li><a href="#notebook">Notebook</a></li>
</ul>
<h3>Summary <span id="summary"></span></h3>
<p>Keras LSTM can be a powerful tool for forecasting. Below is a simple template notebook showing how to setup a data science forecasting experiment.</p>
<h4>Dataset</h4>
<p>A synthetic dataset was generated using a scikit-learn regression generator <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman1.html#sklearn.datasets.make_friedman1" target="_blank">make_friedman1</a>. The dataset is nonlinear, with noise, and some features are …</p><h3>Contents</h3>
<ul>
<li><a href="#summary">Summary</a></li>
<li><a href="#notebook">Notebook</a></li>
</ul>
<h3>Summary <span id="summary"></span></h3>
<p>Keras LSTM can be a powerful tool for forecasting. Below is a simple template notebook showing how to setup a data science forecasting experiment.</p>
<h4>Dataset</h4>
<p>A synthetic dataset was generated using a scikit-learn regression generator <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman1.html#sklearn.datasets.make_friedman1" target="_blank">make_friedman1</a>. The dataset is nonlinear, with noise, and some features are manually scaled to make the deep learning task more challenging. Time series dependence is created by making each label a weighted average of the <i>make_friedman1</i> generated values and previous labels. For details see notebook function <i>generate_data()</i>.</p>
<p>The image below shows correlations between the generated features and future_label we are trying to forecast. Features x_0 - x_4 are the only informative features as can be verified from the bottom row showing meaningful but not very strong correlations:
<a href="/theme/images/lstm_oin235no.png"><img src="/theme/images/lstm_oin235no.png" alt="Validation loss" style="width: 100%" loading="lazy"></a></p>
<h4>Model training</h4>
<p>The model is a simple NN with a single hidden layer defined as
<i>keras.layers.LSTM(32)</i>. The generated dataset is split into training, validation, and test sets, each honoring time series nature of the data. Validation set is used to stop training early to prevent overfitting. However, this is not a concern for our synthetic dataset as can be seen from following chart. The validation curve never starts increasing as training epochs continue:
<a href="/theme/images/lstm_synthetic_data_9827345.png"><img src="/theme/images/lstm_synthetic_data_9827345.png" alt="Validation loss" style="width: 100%" loading="lazy"></a></p>
<h4>Model evalution</h4>
<p>Comparing predictions and actual labels for the validation set shows strong performance even though there are clear optimizations that can be made near extreme values:
<a href="/theme/images/lstm_synthetic_data_val_q4598.png"><img src="/theme/images/lstm_synthetic_data_val_q4598.png" alt="Validation loss" style="width: 100%" loading="lazy"></a></p>
<p>However, the validation set was already used during training for early stopping. This is why we set aside a test dataset the model has never seen during training. The test dataset is the only true evaluation of the expected performance of the model and in this case it confirms that the model performs well for the synthetic dataset:
<a href="/theme/images/lstm_synthetic_data_test_234897f.png"><img src="/theme/images/lstm_synthetic_data_test_234897f.png" alt="Validation loss" style="width: 100%" loading="lazy"></a></p>
<h4>Notebook <span id="notebook"></span></h4>
<ul>
<li><a href="/blog/lstm-forecast-synthetic-data.html#notebook">embedded in blog post</a></li>
<li><a href="/notebooks/lstm_synthetic_data.html" target="_blank">as html</a></li>
<li><a href="https://gist.github.com/adamnovotnycom/36af4c4400a7f970982685472661eba1" target="_blank">as Github Gist</a></li>
</ul>
<p><iframe
title="Keras LSTM Forecasting Using Synthetic Data notebook"
width="100%"
height="17000px"
src="/notebooks/lstm_synthetic_data.html">
</iframe>
</p>Scikit-learn Pipeline with Feature Engineering2021-08-30T00:00:00-05:002021-08-30T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2021-08-30:/blog/custom-scikit-learn-pipeline.html<h4>Contents</h4>
<ol>
<li><a href="#summary">Summary</a></li>
<li><a href="#notebook">Notebook</a></li>
</ol>
<h4 id="summary">Summary</h4>
<p>In general, a machine learning pipeline should have the following characteristics:</p>
<p><a href="/theme/images/1*8PUAA9DjMv6CMsPWhbayIQ.png.png"><img src="/theme/images/1*8PUAA9DjMv6CMsPWhbayIQ.png.png" alt="scikit-learn logo" style="width: 50%" loading="lazy"></a></p>
<ul>
<li>To ensure data consistency, the pipeline should include every step (such as feature engineering) required to train and score training and testing datasets, and score real time requests. The pipeline does not need to include …</li></ul><h4>Contents</h4>
<ol>
<li><a href="#summary">Summary</a></li>
<li><a href="#notebook">Notebook</a></li>
</ol>
<h4 id="summary">Summary</h4>
<p>In general, a machine learning pipeline should have the following characteristics:</p>
<p><a href="/theme/images/1*8PUAA9DjMv6CMsPWhbayIQ.png.png"><img src="/theme/images/1*8PUAA9DjMv6CMsPWhbayIQ.png.png" alt="scikit-learn logo" style="width: 50%" loading="lazy"></a></p>
<ul>
<li>To ensure data consistency, the pipeline should include every step (such as feature engineering) required to train and score training and testing datasets, and score real time requests. The pipeline does not need to include one-off steps such as removing duplicates.</li>
<li>Numerical features are transformed using scikit-learn classes. SimpleImputer is used to fill missing values and StandardScaler for scaling.</li>
<li>Categorical columns are similarly transformed. OneHotEncoder is applied transforming columns containing categorical values. Importantly, I like to define the categories argument to prevent the <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality" target="_blank">Curse of dimensionality</a> that might occur when too many categories are present.</li>
<li>An example custom feature engineering class DailyTrendFeature is included in the pipeline for illustration.</li>
<li>The pipeline allows for parallel preprocessing subject to the limits of the computing environment. For example, the preprocessing of categorical and numerical features can take place in parallel because the transformation steps are independent of each other. This is accomplished using scikit-learn's <pre>FeatureUnion(n_jobs=-1, ...)</pre> class that combines other pipeline steps.</li>
<li><a href="https://gist.github.com/adamnovotnycom/a09294f179d8e483d5411eb5c8c4e00f" target="_blank">Notebook as Github Gist</a></li>
</ul>
<h4 id="notebook">Notebook</h4>
<p><iframe
title="Scikit-learn Pipeline with Feature Engineering notebook"
width="100%"
height="10000px"
src="/notebooks/sklearn_pipe.html">
</iframe>
</p>Global Temperature Forecast Using Prophet and CO22021-05-16T00:00:00-05:002021-05-16T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2021-05-16:/blog/berkeley-global-temperature-forecast-prophet.html<p>
In this article I will leverage the global <a href="https://adamnovotny.com/blog/berkeley-earth-global-temperature-data2.html">temperate dataset I discussed previously</a> to make a temperature forecast using <a href="https://facebook.github.io/prophet/">Facebook Prophet</a> for the next 50 years. Note: the temperature dataset serves ONLY as a vehicle to learn how to do forecasting using Prophet. In general, climate and other complex sciences …</p><p>
In this article I will leverage the global <a href="https://adamnovotny.com/blog/berkeley-earth-global-temperature-data2.html">temperate dataset I discussed previously</a> to make a temperature forecast using <a href="https://facebook.github.io/prophet/">Facebook Prophet</a> for the next 50 years. Note: the temperature dataset serves ONLY as a vehicle to learn how to do forecasting using Prophet. In general, climate and other complex sciences cannot be solved using a simple tool such ash Prophet.</p>
<p> All code can be found in this <a href="https://gist.github.com/adamnovotnycom/8752aa0732576eac32de4e0b9fbda601">gist</a>.
</p>
<section>
<h4>Data</h4>
<p>
To review, the temperature dataset covers monthly data since 1850 including 95% confidence intervals (high CI - blue, low CI - red):
</p>
<a href="/theme/images/bx08l40tssl2p9wv.png">
<img style="width: 100%" loading="lazy" alt="temperature dataset" src="/theme/images/bx08l40tssl2p9wv.png">
</a>
<p>
In addition, I will use the CO2 emmissions data from <a href="https://ourworldindata.org/grapher/annual-co2-emissions-per-country?tab=chart&time=1924..latest&country=~OWID_WRL">ourworldindata.org</a>:
</p>
<a href="/theme/images/4ob68f4dlgcu1r4z.png">
<img style="width: 100%" loading="lazy" alt="CO2 emissions dataset" src="/theme/images/4ob68f4dlgcu1r4z.png">
</a>
</section>
<section>
<h4>Forecast</h4>
<p>
I will only highlight here how the Prophet API works (specifically when we want include an additional regressor such as CO2). First, we need to format the training dataset such that the label column is <i>y</i> and date is <i>ds</i>
</p>
<a href="/theme/images/4na1arghpydyjqu6.png">
<img style="width: 100%" loading="lazy" alt="Prophet training dataset" src="/theme/images/4na1arghpydyjqu6.png">
</a>
<p>
Next, we train the Prophet model and add the custom regressor (CO2):
</p>
<pre>
m = Prophet()
m.add_regressor("co2_monthly_bn_tons",
prior_scale=0.5,
mode="multiplicative",
standardize=True)
m.fit(prophet_train_set)
</pre>
<p>
Then we need to create a forecast dataset that includes the dates to be forecasted and assumptions for the custom regressor. In the temperature forecasting dataset, I created timestamps for the next 50 years. Last 3 rows of the forecast dataset ("prophet_forecast_set"):
</p>
<a href="/theme/images/h12zg3yiwmdk3pa1.png">
<img style="width: 100%" loading="lazy" alt="Prophet forecast dataset" src="/theme/images/h12zg3yiwmdk3pa1.png">
</a>
<p>
In order to create the dataset above, I had to make an assumption about CO2 growth. I assumed that monthly growth over the next 50 years will continue at the same pace as it has between 2000-2020:
</p>
<a href="/theme/images/9p5vaqrkoh3u5bgr.png">
<img style="width: 100%" loading="lazy" alt="CO2 growth assumptions" src="/theme/images/9p5vaqrkoh3u5bgr.png">
</a>
<p>
In reality, the value of the temperature forecast comes from the data scientist's background knowledge of the field. In this example, in order for the temperature forecast to be valuable, we have to be able to forecast CO2 emissions (and other regressors) with high confidence.
</p>
<p>
Performing the actual forecast using Prophet is very simple:
</p>
<pre>
forecast_prophet = m.predict(prophet_forecast_set)
forecast_prophet.head(5)
</pre>
<p>
Prophet generates valuable confidence intervals for its forecast. These confidence bars are more valuable than the point forecast itself. In the chart below, the point forecast in 2070 is 16.1C. However, the forecast ranges widely from nearly 17C to 15.2C.
</p>
<a href="/theme/images/66bdwdd86os7jq45.png">
<img style="width: 100%" loading="lazy" alt="Temperature forecast" src="/theme/images/66bdwdd86os7jq45.png">
</a>
</section>
<section>
<h4>Validation</h4>
<p>
The step that many people doing forecasts "conveniently" skip is validation. In other words, if we approached the problem the same way in the past, how incorrect would we turn out to be today.
</p>
<p>
Let's assume that we are standing in 1970, and we apply the exact same methodology as above to forecast the next 50 years (so we are forecasting 1970-2020). What would the forecasting graphs look like compared to the reality we've already experienced? First, our hypothetical CO2 assumption would match reality reasonably nicely:
</p>
<a href="/theme/images/4npp2fdw5x5cinm3.png">
<img style="width: 100%" loading="lazy" alt="Hypothetical CO2 forecast since 1970" src="/theme/images/4npp2fdw5x5cinm3.png">
</a>
<p>
However, our temperature point forecast would underestimate reality. Our forecast is still within confidence intervals because it nearly perfectly aligns with the upper bound. However, the behavior of the forecast doesn't appear to reflect the upward slope we've experienced historically:
</p>
<a href="/theme/images/b74bpz4kt579ecxe.png">
<img style="width: 100%" loading="lazy" alt="Hypothetical temperature forecast since 1970" src="/theme/images/b74bpz4kt579ecxe.png">
</a>
<p>
This is an example of why confidence intervals are more important than point estimates. Also, it reflects how important it is to be intellectually honest when forecasting and performing historical validation. The takeaway here might be that we are missing additional regressors to be able to properly forecast the temperature physical process.
</p>
</section>Berkeley Earth Global Temperature Data2021-05-14T00:00:00-05:002021-05-14T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2021-05-14:/blog/berkeley-earth-global-temperature-data2.html<p><a href="http://berkeleyearth.org/data/">Berkeley Earth</a> publishes an unique dataset with global temperature measurements. Below is a guide to the download the data and start analyzing it using Python. All code can be found in this <a href="https://gist.github.com/adamnovotnycom/e844fbfdbcc563123cbbfcd96604bb7b">gist</a>.</p>
<p><a href="/theme/images/pztgdtmbiuigjqxlyuejjkjmprukqdkbjqvcbdc.png"><img style="width: 100%" loading="lazy" alt="Berkeley Earth air temperature measurements above sea ice" src="/theme/images/pztgdtmbiuigjqxlyuejjkjmprukqdkbjqvcbdc.png"></a></p>
<p>Download .txt file from <a href="http://berkeleyearth.org/data/">Berkeley Earth</a> data website section "Land + Ocean (1850 — Recent)" and read it using …</p><p><a href="http://berkeleyearth.org/data/">Berkeley Earth</a> publishes an unique dataset with global temperature measurements. Below is a guide to the download the data and start analyzing it using Python. All code can be found in this <a href="https://gist.github.com/adamnovotnycom/e844fbfdbcc563123cbbfcd96604bb7b">gist</a>.</p>
<p><a href="/theme/images/pztgdtmbiuigjqxlyuejjkjmprukqdkbjqvcbdc.png"><img style="width: 100%" loading="lazy" alt="Berkeley Earth air temperature measurements above sea ice" src="/theme/images/pztgdtmbiuigjqxlyuejjkjmprukqdkbjqvcbdc.png"></a></p>
<p>Download .txt file from <a href="http://berkeleyearth.org/data/">Berkeley Earth</a> data website section "Land + Ocean (1850 — Recent)" and read it using the following Python command:</p>
<pre>
colspecs = [(2, 6), (10, 12), (14, 22), (24, 29)]
df = pd.read_fwf(
"/content/drive/My Drive/Colab Notebooks/berkeley_earth/data/Land_and_Ocean_complete.txt",
colspecs=colspecs,
header=85
)
df.columns = ["year", "month", "anomaly_C", "confidence_95_C"]
df.head(12)
</pre>
<p>colspecs defines the column indexes so (2, 6) represents year in the source text file.</p>
<p><a href="/theme/images/0dpug746gz4r84cl.png"><img style="width: 100%" loading="lazy" alt="/theme/images/" src="/theme/images/0dpug746gz4r84cl.png"></a></p>
<p>The data documentation explains that <i>anomaly_C</i> is the recorded temperature anomaly in Celsius relative to estimated Jan 1951-Dec 1980 global mean temperature of 14.108 +/- 0.02. The chart below shows the absolute air temperatures along with 95% uncertainty intervals (in green) recorded during the 2000s.</p>Dynamic HTML with Python, AWS Lambda, and Containers2021-03-27T00:00:00-05:002021-03-27T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2021-03-27:/blog/dynamic-html-with-python-aws-lambda-and-containers.html<p>This article is an extension of my previous article describing a similar <a href="https://adamnovotny.com/blog/serving-dynamic-web-pages-using-python-and-aws-lambda.html" target="_blank">deployment process using native AWS Lambda tools</a>. However, Amazon since started <a href="https://aws.amazon.com/blogs/aws/new-for-aws-lambda-container-image-support/" target="_blank">supporting container images</a> and updated it’s pricing policy to <a href="https://aws.amazon.com/blogs/aws/new-for-aws-lambda-1ms-billing-granularity-adds-cost-savings/" target="_blank">1ms granularity</a>. Both are major developments improving tooling and making small deployments cost effective.</p>
<p><a href="/theme/images/1*WSpeFmskKx0xiwx-WRRJ6A.jpeg.png"><img src="/theme/images/1*WSpeFmskKx0xiwx-WRRJ6A.jpeg.png" alt="Deploying AWS Lambda using a container" style="width: 100%" loading="lazy"></a></p>
<p>My <a href="https://adamnovotny.com/blog/serving-dynamic-web-pages-using-python-and-aws-lambda.html" target="_blank">previous</a> article …</p><p>This article is an extension of my previous article describing a similar <a href="https://adamnovotny.com/blog/serving-dynamic-web-pages-using-python-and-aws-lambda.html" target="_blank">deployment process using native AWS Lambda tools</a>. However, Amazon since started <a href="https://aws.amazon.com/blogs/aws/new-for-aws-lambda-container-image-support/" target="_blank">supporting container images</a> and updated it’s pricing policy to <a href="https://aws.amazon.com/blogs/aws/new-for-aws-lambda-1ms-billing-granularity-adds-cost-savings/" target="_blank">1ms granularity</a>. Both are major developments improving tooling and making small deployments cost effective.</p>
<p><a href="/theme/images/1*WSpeFmskKx0xiwx-WRRJ6A.jpeg.png"><img src="/theme/images/1*WSpeFmskKx0xiwx-WRRJ6A.jpeg.png" alt="Deploying AWS Lambda using a container" style="width: 100%" loading="lazy"></a></p>
<p>My <a href="https://adamnovotny.com/blog/serving-dynamic-web-pages-using-python-and-aws-lambda.html" target="_blank">previous</a> article focused on the logic of the code and didn’t address how to actually deploy the function because that was well covered by AWS in its many tutorials. Here I explore the new the container deployment options while keeping all business logic untouched. Please review the AWS tutorial on deploying a <a href="https://docs.aws.amazon.com/lambda/latest/dg/python-image.html" target="_blank">generic Python Lambda code using containers</a> which I leveraged below.</p>
<h4>1. Dockerfile</h4>
<pre>FROM public.ecr.aws/lambda/python:3.8
RUN mkdir -p /mnt/app
ADD app.py /mnt/app
ADD index.html /mnt/app
WORKDIR /mnt/app
RUN pip install --upgrade pip
RUN pip install Jinja2==2.11.*
CMD ["/mnt/app/app.handler"]</pre>
<p>I am using the AWS base image because it is packaged with a very nice mini server that simulates function responses when developing locally. This is extremely useful because we can call the function with 100s of arguments and verify that it behaves as expected before deployed.</p>
<h4>App code</h4>
<p>From the Dockerfile, we can see that all application code is contained in two files:</p>
<p>1) app.py:</p>
<pre>import os
from jinja2 import Environment, FileSystemLoader</pre>
<pre>def lambda_handler(event, context):
env = Environment(loader=FileSystemLoader(os.path.join(os.path.dirname(__file__), "."), encoding="utf8"))
my_name_from_query = False
if event["queryStringParameters"] and "my_name" in event["queryStringParameters"]:
my_name_from_query = event["queryStringParameters"]["my_name"]
template = env.get_template("index.html")
html = template.render(
my_name=my_name_from_query
)
return {
"statusCode": 200,
"body": html,
"headers": {
"Content-Type": "text/html",
}
}</pre>
<p>2) index.html:</p>
<p><a href="/theme/images/1*aJfoRhYbqPXONxww3qyYeA.png.png"><img src="/theme/images/1*aJfoRhYbqPXONxww3qyYeA.png.png" alt="index.html" style="width: 100%" loading="lazy"></a></p>
<p>app.py simply parses one argument named “my_name” from the Lambda query string and passes it to the html template as variable named “my_name”. Jinja2 then parses the variable and returns the final template.</p>
<h4>Calling and testing the app locally</h4>
<p>Testing the app locally is very simple thanks to the new container packaging. Simply run docker-compose -f docker-compose.yml up, where docker-compose.yml file is defined as:</p>
<pre>version: '3'
services:
cont_name:
container_name: cont_name
image: cont_name_img
build:
context: .
dockerfile: Dockerfile
volumes:
- .:/mnt/app
ports:
- "9000:8080"
stdin_open: true
tty: true
restart: always</pre>
<p>This stands up the function locally on a simple AWS-provided server. We can send requests and monitor responses using Python code such as:</p>
<pre>import requests
r = requests.get(
"http://localhost:9000/2015-03-31/functions/function/invocations",
data=open("event.json", "rb")
)
print(r.json())</pre>
<p>where “event.json” is any .json file we wish to send to the lambda function as arguments. In the example case above, we would send something like:</p>
<pre>{
"queryStringParameters": {
"my_name": "Adam"
}
}</pre>
<h4>Cost</h4>
<p>The simple AWS base server returns responses such as the one below. This is where we can see the significant impact of the new 1ms pricing update. The cost of running this example code is about 9ms which is very small considering that we are returning a full html template to browsers. However, previously AWS would charge for the full 100ms because that was the minimum charge defined. Now, this function could cost nearly 90% less!</p>
<p><a href="/theme/images/1*5Xq3l1IxQmCWQxIkSg4wLw.png.png"><img src="/theme/images/1*5Xq3l1IxQmCWQxIkSg4wLw.png.png" alt="Lambda duration" style="width: 100%" loading="lazy"></a></p>Google Colab and Auto-sklearn with Profiling2021-03-20T00:00:00-05:002021-03-20T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2021-03-20:/blog/google-colab-and-auto-sklearn-with-profiling.html<p>This article is a follow up to my previous tutorial on how to <a href="https://adamnovotny.com/blog/google-colab-and-automl-auto-sklearn-setup.html" target="_blank">setup Google Colab and auto-sklean</a>. Here, I will go into more detail that shows auto-sklearn performance on an artificially created dataset. The full notebook gist can be found <a href="https://gist.github.com/adamnovotnycom/ffe8e3961fe0207c64a1b9a074883e51" target="_blank">here</a>.</p>
<p>First, I generated a regression dataset using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html" target="_blank">scikit …</a></p><p>This article is a follow up to my previous tutorial on how to <a href="https://adamnovotny.com/blog/google-colab-and-automl-auto-sklearn-setup.html" target="_blank">setup Google Colab and auto-sklean</a>. Here, I will go into more detail that shows auto-sklearn performance on an artificially created dataset. The full notebook gist can be found <a href="https://gist.github.com/adamnovotnycom/ffe8e3961fe0207c64a1b9a074883e51" target="_blank">here</a>.</p>
<p>First, I generated a regression dataset using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html" target="_blank">scikit learn</a>.</p>
<pre>X, y, coeff = make_regression(
n_samples=1000,
n_features=100,
n_informative=5,
noise=0,
shuffle=False,
coef=True
)</pre>
<p><a href="/theme/images/1*Nv5JrZA6e7M9-K_gPxsspg.jpeg.png"><img src="/theme/images/1*Nv5JrZA6e7M9-K_gPxsspg.jpeg.png" alt="Subset of 100 generated features" style="width: 100%" loading="lazy"></a></p>
<p>This generates a dataset with 100 numerical features where the first 5 features are informative (these are labeled as “feat_0” to “feat_4”). The rest (“feat_5” to “feat_99”) are random noise. We can see this in the scatter matrix above where only the first 5 features show a correlation with the label.</p>
<p>We know that this is a simple regression problem which could be solved using a linear regression perfectly. However, knowing what to expect helps us to verify the performance of auto-sklearn which trains its ensemble model using the following steps:</p>
<pre>import autosklearn.regressionautoml = autosklearn.regression.AutoSklearnRegressor(
time_left_for_this_task=300,
n_jobs=-1
)
automl.fit(
X_train_transformed,
df_train["label"]
)</pre>
<p>I also created random categorical features which are then one-hot-encoded into a feature set “X_train_transformed“. Running the AutoSklearnRegressor for 5 minutes (time_left_for_this_task=300) produced the following expected results:</p>
<pre>predictions = automl.predict(X_train_transformed)
r2_score(df_train["label"], predictions)
>> 0.999
predictions = automl.predict(X_test_transformed)
r2_score(df_test["label"], predictions)
>> 0.999</pre>
<p>A separate pip package <a href="https://github.com/VIDA-NYU/PipelineVis" target="_blank">PipelineProfiler</a> helps us visualize the steps auto-sklearn took to achieve the result:</p>
<p><a href="/theme/images/1*9ZWW9HeGqTjkan4qtd4mDQ.jpeg.png"><img src="/theme/images/1*9ZWW9HeGqTjkan4qtd4mDQ.jpeg.png" alt="PipelineProfiler output" style="width: 100%" loading="lazy"></a></p>
<p>Above we can see the attempts auto-sklearn made to generate the best emsemble of models within the 5 minute constraint I set. The best model found was <a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html" target="_blank">Liblinear SVM</a>, which produced R2 of nearly 1.0. As a result, this toy ensemble model gives weight of 1.0 to just one algorithm. Libsvm Svr and Gradient boosting scored between 0.9–0.96.</p>Google Colab and AutoML: Auto-sklearn Setup2020-12-04T00:00:00-05:002020-12-04T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2020-12-04:/blog/google-colab-and-automl-auto-sklearn-setup.html<p>Auto ML is fast becoming a popular solution to build minimal viable models for new projects. A popular library for Python is <a href="https://automl.github.io/auto-sklearn/master/#" target="_blank">Auto-sklearn</a> that leverages the most popular Python ML library <a href="http://sklearn.org" target="_blank">scikit-learn</a>. Auto-sklearn runs a smart search over scikit-learn models and parameters to find the best performing ensemble of models …</p><p>Auto ML is fast becoming a popular solution to build minimal viable models for new projects. A popular library for Python is <a href="https://automl.github.io/auto-sklearn/master/#" target="_blank">Auto-sklearn</a> that leverages the most popular Python ML library <a href="http://sklearn.org" target="_blank">scikit-learn</a>. Auto-sklearn runs a smart search over scikit-learn models and parameters to find the best performing ensemble of models.</p>
<p><a href="/theme/images/1*n6-MAHisW5-xLrEUndHe5g.png.png"><img src="/theme/images/1*n6-MAHisW5-xLrEUndHe5g.png.png" alt="Logos of Google Drive + Colab + Scikit-learn + Auto-sklearn" style="width: 100%" loading="lazy"></a></p>
<p>This tutorial describes how to setup Auto-sklearn on <a href="https://colab.research.google.com/" target="_blank">Google Colab</a>. The complete <a href="https://gist.github.com/adamnovotnycom/1df7ef10649d8241c389c96becb7fe37" target="_blank">notebook gist</a> includes a toy project that uses an <a href="https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data" target="_blank">old Airbnb dataset</a> from Kaggle.</p>
<p>The key first step is to install linux dependencies alongside Auto-sklearn:</p>
<pre>!sudo apt-get install build-essential swig
!pip install auto-sklearn==0.11.1</pre>
<p>After running these commands in Colab, restart the Colab runtime and run all commands again.</p>
<p>The Airbnb dataset can be used for a regression project where price is the label. I selected a few numerical and categorical features randomly so the dataset used for modeling has the following characteristics:</p>
<p><a href="/theme/images/1*-lXTkg7Y9W-XMdNPR5KPkA.png.png"><img src="/theme/images/1*-lXTkg7Y9W-XMdNPR5KPkA.png.png" alt="Airbnb dataset description" style="width: 100%" loading="lazy"></a></p>
<p>A more sophisticated ML project would require a detailed feature selection process and data analysis at this stage. For example, does the maximum value of 1,250 for minimum_nights make sense? In this case, I am simply showing the Auto-sklearn setup so I will skip these time consuming steps.</p>
<p>Next, all numerical features are <a href="https://en.wikipedia.org/wiki/Standard_score" target="_blank">standardized</a> and missing values filled. Scikit-learn (and therefore Auto-sklearn) cannot handle string categories so categorical features are <a href="https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/" target="_blank">one hot encoded</a>. Also, infrequently appearing categories are combined into a single bucket to combat the <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality" target="_blank">Curse of dimensionality</a>. In this case, any neighborhood that appears less than 0.5% of the time is renamed to “neighborhood_other”. Before transformations, the first 5 rows of the training dataset have the following items:</p>
<p><a href="/theme/images/1*5zbUTS8k6rTqTYUtpATzlw.png.png"><img src="/theme/images/1*5zbUTS8k6rTqTYUtpATzlw.png.png" alt="Training dataset before transformations" style="width: 100%" loading="lazy"></a></p>
<p>After transformations, the first few columns of the 5 rows look like this:</p>
<p><a href="/theme/images/1*AySz4rydwvMNfOnt4v-UpA.png.png"><img src="/theme/images/1*AySz4rydwvMNfOnt4v-UpA.png.png" alt="Training dataset after transformations" style="width: 100%" loading="lazy"></a></p>
<p>I am finally ready to explore Auto-sklearn using few simple commands that fit a new model:</p>
<pre>import autosklearn.regression
automl = autosklearn.regression.AutoSklearnRegressor(
time_left_for_this_task=120,
per_run_time_limit=30,
n_jobs=1
)
automl.fit(
X_train_transformed,
y_train
)</pre>
<p>Finally, here is how the model performs on a test dataset:</p>
<pre>import sklearn.metrics
predictions = automl.predict(X_test_transformed)
sklearn.metrics.r2_score(y_test, predictions)
output: 0.1862</pre>
<p>An alternative approach that doesn’t use Auto-sklearn would be to manually select a model and run a grid search to find best parameters. A typical, well-performing algorithm is RandomForestRegressor so I might try the following:</p>
<pre>from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
model = RandomForestRegressor(max_depth=3, random_state=0)
parameters = {
"max_depth": (2, 3, 5)
}
grid = GridSearchCV(model, parameters, cv=5, scoring="r2")
grid.fit(X_train_transformed, y_train.values.ravel())</pre>
<p>For comparison, the performance of this model would be:</p>
<pre>predictions = grid.predict(X_test_transformed)
sklearn.metrics.r2_score(y_test, predictions)
output: 0.0982</pre>
<p>Impressively, the default Auto-sklearn <a href="https://en.wikipedia.org/wiki/Coefficient_of_determination" target="_blank">R2</a> performance of 0.186 is nearly twice as good as simplistic scikit-learn-only performance of 0.098. These are not intended to be absolute benchmarks because I performed no customization but the relative performance is worth noting. The results suggest that Auto-sklearn can set a very reasonable lower performance bound that no model deployed in production should underperform.</p>
<p>More about me: <a href="https://adamnovotny.com" target="_blank">adamnovotny.com</a></p>Google Paper: 24/7 by 20302020-10-31T00:00:00-05:002020-10-31T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2020-10-31:/blog/google-paper-24-7-by-2030.html<p>Google released a <a href="https://www.gstatic.com/gumdrop/sustainability/247-carbon-free-energy.pdf" target="_blank">white paper</a> describing how the company intends to generate all of its electricity needs from renewable energy sources by 2030. Previously, Google committed to reducing emissions by buying offsets or generating renewable energy off-cycle. This new commitment goes by further: “Google intends to match its operational electricity …</p><p>Google released a <a href="https://www.gstatic.com/gumdrop/sustainability/247-carbon-free-energy.pdf" target="_blank">white paper</a> describing how the company intends to generate all of its electricity needs from renewable energy sources by 2030. Previously, Google committed to reducing emissions by buying offsets or generating renewable energy off-cycle. This new commitment goes by further: “Google intends to match its operational electricity use with nearby carbon-free energy sources in every hour of every year”</p>
<p><a href="/theme/images/1*HOTcFQxbxwukF2iOdh3lkw.png.png"><img src="/theme/images/1*HOTcFQxbxwukF2iOdh3lkw.png.png" alt="Google’s energy journey" style="width: 100%" loading="lazy"></a></p>
<p>Everybody interested should read it — it’s short.</p>
<p>Google cooperated with a <a href="https://www.watttime.org/" target="_blank">Watttime</a> to generate the dataset that measures the carbon emissions intensity in regions where Google’s data centers are located. Watttime has a very interesting <a href="https://www.watttime.org/api-documentation/#introduction" target="_blank">API</a> providing carbon intensity in real time. I collected a random set of data points over 24 hours for a selected number of regions where Google data centers are <a href="https://www.google.com/about/datacenters/locations/" target="_blank">located</a> in the US. All code is available in this <a href="https://gist.github.com/excitedAtom/0d980a908a35732e9e55e8b1e8f27985" target="_blank">Github gist</a>.</p>
<p><a href="/theme/images/1*U11j-6mlV0pwUsNslceXaQ.png.png"><img src="/theme/images/1*U11j-6mlV0pwUsNslceXaQ.png.png" alt="Marginal Operating Emissions Rates (MOER) of Select Google Data Centers" style="width: 100%" loading="lazy"></a></p>
<p>Marginal Operating Emissions Rates (MOER): 0 represents no emissions (clean energy generation), 100 represents highest emissions. In other words, for all regions where MOER is high, Google has a lot of work to do to replace (or store) electricity used from clean sources. In the chart above, <a href="https://www.google.com/about/datacenters/locations/midlothian/" target="_blank">Midlothian, TX</a> appears to be one of those challenging locations. On the other hand, Mayes County, OK is a location that Google appears to be satisfied with: “Our highest clean energy percentage is in Oklahoma (Southwest Power Pool), where our purchases of wind power helped drive carbon-free energy performance at our data center from 41% to 96%”</p>
<p>I firmly believe that renewable energy will be widely adopted only if it is at least as cheap as alternatives. So exploring the data from a financial perspective is critical: From 2009 to 2019, costs for wind and solar power declined by 70% and 89% (<a href="https://www.lazard.com/media/451086/lazards-levelized-cost-of-energy-version-130-vf.pdf" target="_blank">see page 8</a>).</p>Serving Dynamic Web Pages using Python and AWS Lambda2020-07-25T00:00:00-05:002020-07-25T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2020-07-25:/blog/serving-dynamic-web-pages-using-python-and-aws-lambda.html<p>While AWS Lambda functions are typically used to build API endpoints, at their core Lambda functions can return almost anything. This includes returning html markup with dynamic content.</p>
<p><a href="/theme/images/1*9bPdHLV7ghV1RuNYOGkTvA.png.png"><img src="/theme/images/1*9bPdHLV7ghV1RuNYOGkTvA.png.png" alt="AWS Lambda + Python + Jinja" style="width: 100%" loading="lazy"></a></p>
<p>I will not go into details describing how to deploy AWS Lambda functions. Please see the official <a href="https://docs.aws.amazon.com/lambda/latest/dg/lambda-python.html" target="_blank">documentation</a>. I will however describe …</p><p>While AWS Lambda functions are typically used to build API endpoints, at their core Lambda functions can return almost anything. This includes returning html markup with dynamic content.</p>
<p><a href="/theme/images/1*9bPdHLV7ghV1RuNYOGkTvA.png.png"><img src="/theme/images/1*9bPdHLV7ghV1RuNYOGkTvA.png.png" alt="AWS Lambda + Python + Jinja" style="width: 100%" loading="lazy"></a></p>
<p>I will not go into details describing how to deploy AWS Lambda functions. Please see the official <a href="https://docs.aws.amazon.com/lambda/latest/dg/lambda-python.html" target="_blank">documentation</a>. I will however describe how to return dynamic html content instead of a typical <a href="https://en.wikipedia.org/wiki/JSON" target="_blank">JSON</a>.</p>
<h4>Step 0 — Optional</h4>
<p>If you prefer to develop and test lambda functions locally (as I do), you can use Docker to simulate the AWS lambda function environment. A sample Dockerfile I use is below.</p>
<pre>FROM amazonlinux:latest
RUN mkdir -p /mnt/app
ADD . /mnt/app
WORKDIR /mnt/app
RUN yum update -y
RUN yum install gcc -y
RUN yum install gcc-c++ -y
RUN yum install findutils -y
RUN yum install zip -y
RUN amazon-linux-extras install python3=3.6.2
RUN pip3 install --upgrade pip
RUN pip3 install -r requirements.txt -t aws_layer/python</pre>
<p>The requirements.txt includes just one package for simplicity. It is the common <a href="https://jinja.palletsprojects.com/en/2.11.x/" target="_blank">templating for Python called Jinja2</a></p>
<pre>Jinja2==2.11.1</pre>
<p>You can test your Lambda function by simple calling it with sample parameters:</p>
<pre>import lambda_function
event = {
"queryStringParameters": {
"param1": "value1"
},
"path": "/api",
"requestContext": {
"param2": "value2"
}
}
res = lambda_function.lambda_handler(event=event, context={})
assert 200 == int(res["statusCode"])</pre>
<h4>Step 1 — Write html template</h4>
<p>In this step, we write the html template the Lambda function will return. A good default is the new <a href="https://v5.getbootstrap.com/docs/5.0/getting-started/introduction/" target="_blank">Bootstrap 5</a> CSS framework where the recommended starting markup looks something like this:</p>
<p><a href="/theme/images/1*b3ZXkCsw8BwLt2Fx_eu6PQ.png.png"><img src="/theme/images/1*b3ZXkCsw8BwLt2Fx_eu6PQ.png.png" alt="Sample HTML page" style="width: 100%" loading="lazy"></a></p>
<p>Saving this file in folder “templates” and naming it index.html, we are ready to write the Lambda function.</p>
<h4>Step 2 — Write Lambda function to serve your html page</h4>
<p>In the example below, the lambda function expects URL parameters and parses those. So when parsing a custom URL, the format would look something like this: Example.com/?my_name=somename. See step 10 in <a href="https://adamnovotny.com/blog/serverless-web-apps-with-firebase-and-aws-lambda.html" target="_blank">this tutorial</a> to add custom URLs to your API Gateway-triggered Lambda functions.</p>
<pre>import os
import sys
from jinja2 import Environment, FileSystemLoader</pre>
<pre>def lambda_handler(event, context):
env = Environment(loader=FileSystemLoader(os.path.join(os.path.dirname(__file__), "templates"), encoding="utf8"))
my_name = False
if event["queryStringParameters"] and "my_name" in event["queryStringParameters"]:
my_name_query = event["queryStringParameters"]["my_name"]
template = env.get_template("index.html")
html = template.render(
my_name=my_name_query
)
return response(html)</pre>
<pre>def response(myhtml):
return {
"statusCode": 200,
"body": myhtml,
"headers": {
"Content-Type": "text/html",
}
}</pre>
<ul><li>jinja2 loads your previously created index.html using class “FileSystemLoader” and we store it as variable “env”</li></ul>
<ul><li>variable “my_name” is parsed from the URL query parameters as explained above and stored as the Python variable my_name_query</li></ul>
<ul><li>the jinja2 render function then passes my_name_query to the template and returns the html page</li></ul>
<p>Also published on <a href="https://adamnovotny.com" target="_blank">adamnovotny.com</a></p>Custom VPN using PiVPN and public cloud2018-12-30T00:00:00-05:002018-12-30T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2018-12-30:/blog/custom-vpn-using-pivpn-and-public-cloud.html<p>Motivation: Many public Wi-Fi networks block certain internet ports and protocols. For example, a public library might only allow ports 80 and 443 and the TCP protocol. Leaving aside the logic of such decisions by network owners, they prevent users from taking advantage of many commercial VPN products that rely …</p><p>Motivation: Many public Wi-Fi networks block certain internet ports and protocols. For example, a public library might only allow ports 80 and 443 and the TCP protocol. Leaving aside the logic of such decisions by network owners, they prevent users from taking advantage of many commercial VPN products that rely on other ports. The goal of this article is to create a custom VPN solution to improve privacy even on such restricted public networks.
<a href="/theme/images/1*hI_JbE-q2OQpXW1tEB46Cg.png.png"><img src="/theme/images/1*hI_JbE-q2OQpXW1tEB46Cg.png.png" alt="AWS, Google Cloud, Microsoft Azure" style="width: 100%" loading="lazy"></a>
<a href="/theme/images/1*7C1odbX4Kk_yxToAaKnlwA.png.png"><img src="/theme/images/1*7C1odbX4Kk_yxToAaKnlwA.png.png" alt="PiVPN" style="width: 100%" loading="lazy"></a></p>
<p>All step are outlined in more detail in <a href="https://github.com/adam5ny/vpn-gcp-pivpn" target="_blank">this Github repo</a>. The tutorial is written for Python 3 and <a href="https://cloud.google.com/compute/" target="_blank">Google Cloud Compute</a>. However, all public clouds can be used including AWS or Azure.</p>
<h4>Create public cloud compute instance</h4>
<p>Login to <a href="https://console.cloud.google.com/" target="_blank">GCP console</a>. Create an Ubuntu machine and make sure to allow https traffic. Then locate your public IP which is where your traffic will be routed. In GCP, you can find it using the following steps (as of Dec 2018): VPC network > External IP addresses > switch the type of your instance IP from “Ephemeral” to “Static”. This will be your public IP.</p>
<h4>Create PiVPN instance</h4>
<p>Login to your compute instance and download PiVPN using the following command:</p>
<pre>curl -L <a href="https://install.pivpn.io" target="_blank">https://install.pivpn.io</a> | bash</pre>
<p>Follow all setup steps using default values except for port and protocol. Select port 443 and protocol TCP. Select reboot at the end of the installation.</p>
<h4>Create VPN credentials</h4>
<pre>pivpn add</pre>
<p>Enter your custom username and password. Download credentials to your computer from your newly create cloud computer instance. Credentials are typically located at ~/ovpns on your Ubuntu instance.</p>
<h4>Download a VPN client for your platform</h4>
<p>For MacOS you may use <a href="https://tunnelblick.net/" target="_blank">Tunnelblick</a>. Then drag credentials (.ovpn) from the previous step to the Tunneblick app icon. Click on Tunneblick icon to connect to your VPN with your custom username and password.</p>
<h4>Final notes:</h4>
<ul><li>use port 80 instead of 443 above if necessary.</li></ul>
<ul><li>to minimize cost, remember to shut down your compute instance when you are not using the VPN. The typical cost is < $20 for 100GB of traffic and 24/7 usage which is in line with respectable third-party VPN providers. However, your cost may be significantly lower if you shut down unused instances and use pre-emptible instances (GCP-specific).</li></ul>Serverless web apps with Firebase and AWS Lambda2018-09-09T00:00:00-05:002018-09-09T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2018-09-09:/blog/serverless-web-apps-with-firebase-and-aws-lambda.html<p>Serverless has become a popular solution for small to medium-sized projects. The downside is a technology stack lock-in which forces developers to use technologies that might not be optimal for their projects. For example, people using <a href="https://firebase.google.com/" target="_blank">Google’s Firebase</a> to host their static resources have to write custom endpoint <a href="https://firebase.google.com/docs/functions/" target="_blank">functions …</a></p><p>Serverless has become a popular solution for small to medium-sized projects. The downside is a technology stack lock-in which forces developers to use technologies that might not be optimal for their projects. For example, people using <a href="https://firebase.google.com/" target="_blank">Google’s Firebase</a> to host their static resources have to write custom endpoint <a href="https://firebase.google.com/docs/functions/" target="_blank">functions</a> in JavaScript or TypeScript (as of August 2018). Developers typically use custom backend functions to hide business logic or proprietary data operations from users because anything that runs in the browsers front end as JavaScript is ultimately an open book from the users perspective.</p>
<p><a href="/theme/images/1*6QfJob8HhDsYsGjzYSb3eA.jpeg.png"><img src="/theme/images/1*6QfJob8HhDsYsGjzYSb3eA.jpeg.png" alt="Firebase + AWS Lambda" style="width: 100%" loading="lazy"></a></p>
<p>One simple solution is to combine Firebase with custom functions using a different platform. I will outline the steps to create a Firebase-hosted web app, setup DNS for subdomain, and create AWS Lambda functions to serve custom business logic as APIs. This is just an example setup and all major cloud players provide solutions that can be combined in other ways such as using <a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/WebsiteHosting.html" target="_blank">AWS S3 to host statics resources</a> and <a href="https://cloud.google.com/functions/?utm_source=google&utm_medium=cpc&utm_campaign=emea-emea-all-en-dr-bkws-all-all-trial-e-gcp-1003963&utm_content=text-ad-none-any-DEV_c-CRE_253480695966-ADGP_Hybrid%20%7C%20AW%20SEM%20%7C%20BKWS%20~%20EXA_M:1_EMEA_EN_General_Cloud%20Functions_ETL%20Warehouse-KWID_43700019207153401-kwd-34012173938-userloc_1003837&utm_term=KW_function%20google-ST_function%20google&ds_rl=1245734&gclid=EAIaIQobChMInM_0gaH43AIVpp3tCh1CTAHNEAAYASAAEgJ1iPD_BwE&dclid=COGAr4Oh-NwCFQ6ZdwodUTIBBQ" target="_blank">Google’s Cloud Function</a> to serve business logic API.</p>
<p>At the end we will have:</p>
<ul><li>www.example.com (<a href="https://en.wikipedia.org/wiki/Single-page_application" target="_blank">single page app</a> served by Firebase)</li></ul>
<ul><li>api.example.com (AWS lambda function serving custom business logic used by www.example.com)</li></ul>
<p>I will not go into specific details on each platform because their UIs constantly change. Instead, I will highlight the sequence of steps I typically take to setup the services quickly.</p>
<h4>Firebase hosting</h4>
<ul><li>1) Deploy a static web app to <a href="https://firebase.google.com/products/hosting/" target="_blank">Firebase</a> by following <a href="https://firebase.google.com/docs/hosting/deploying" target="_blank">this part</a> of the Firebase documentation. The end result will be a public web app. Its URL will look something like this: my-project-name.firebaseapp.com</li></ul>
<p><a href="/theme/images/1*gH7ELyB6td5o2A1bY4drGw.png.png"><img src="/theme/images/1*gH7ELyB6td5o2A1bY4drGw.png.png" alt="Firebase hosting setup" style="width: 100%" loading="lazy"></a></p>
<ul><li>2) Let’s assume we purchased the custom domain example.com. We now need to update the DNS records so that example.com and www.example.com point to our static web app.</li></ul>
<ul><li>3) Go to your Firebase project dashboard and in the hosting section initiate the steps to connect to a custom domain. You will need to verify ownership of the domain by adding a DNS TXT record to your registrar’s DNS settings. As always, the <a href="https://firebase.google.com/docs/hosting/custom-domain" target="_blank">documentation</a> is useful.</li></ul>
<p><a href="/theme/images/1*Kd7lW2alk8UA4cL8kkvxxA.png.png"><img src="/theme/images/1*Kd7lW2alk8UA4cL8kkvxxA.png.png" alt="Connect custom domain to Firebase app" style="width: 100%" loading="lazy"></a></p>
<ul><li>4) Go to your domain registrar’s DNS settings, and create a DNS A record for subdomain www pointing to the IP address of the Firebase servers obtained in the previous step. After SSL certificates are automatically provisioned by Firebase, users can go to https://www.example.com to locate your Firebase app.</li></ul>
<ul><li>5) We also need to make sure that users entering just example.com are also pointed to https://www.example.com. To accomplish this, return to your registrar’s DNS settings and a setup subdomain forwarding. The exact steps vary for each registrar but the end result will be example.com -> https://www.example.com. If possible, set the redirect as permanent 301, forward path, and enable SSL.</li></ul>
<h4>AWS Lambda</h4>
<p>At this point we have a web app deployed and using our custom URL. The app however uses the subdomain api.example.com to obtain proprietary data. In Angular, the code requesting data from the subdomain may look something like this:</p>
<pre>const headers = {
headers: new HttpHeaders({
'Content-Type': 'application/json',
'x-api-key': 'some-api-key'
})
};
this.http.get('<a href="https://api.examle.com/get-data'" target="_blank">https://api.example.com'</a>, headers)
.subscribe((data: string) => {
const dataJson = JSON.parse(data);
// some data operations
}
);</pre>
<p>If our backend is relatively simple (doesn’t require large third party packages) and runs fast, the easiest solution is to deploy cloud functions at one of the largest providers. AWS limits Lambda deployments to 50MB and the default timeout is 3 seconds which are reasonable guidelines to determine whether your custom API backend is suitable for serverless functions.</p>
<ul><li>6) We need to create a Lambda function. I like to test my lambda functions locally and then deploy them as zip files to AWS. For Python, follow <a href="https://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html" target="_blank">this</a> tutorial. Lambda supports all major languages and similar tutorials exists for at least Node.js, C#, Go, Java.</li></ul>
<p><a href="/theme/images/1*ktkhPraczTuDeRUkbkhUUA.png.png"><img src="/theme/images/1*ktkhPraczTuDeRUkbkhUUA.png.png" alt="AWS Lambda zip upload" style="width: 100%" loading="lazy"></a></p>
<ul><li>7) Next we need to make the function publicly available so we will use API Gateway to create a public endpoint. Make sure to check that API key is required and then go to Actions and Deploy API.</li></ul>
<p><a href="/theme/images/1*HgGVxaoWfxJo9jM3WSs-3Q.png.png"><img src="/theme/images/1*HgGVxaoWfxJo9jM3WSs-3Q.png.png" alt="API Gateway endpoint creation" style="width: 100%" loading="lazy"></a></p>
<ul><li>8) Secure the endpoint with at least an API key which can be created in the API Gateway as well.</li></ul>
<p><a href="/theme/images/1*UQQ7OzNmF9iNxeWxlIXYDQ.png.png"><img src="/theme/images/1*UQQ7OzNmF9iNxeWxlIXYDQ.png.png" alt="API key generation example" style="width: 100%" loading="lazy"></a></p>
<ul><li>9) Create a Usage Plan that will limit how often your API can be used. This will prevent your Lambda function from being overused. While AWS Lambda has a very generous free tier, security is paramount for peace of mind. A Usage Plan basically connects the API key (step 8) to the endpoint deployment (step 7). At this point, you should be able to use your Lambda function by going to a URL that looks something like this https://xyz1234567.execute-api.us-east-1.amazonaws.com/stage. Remember that an API key is required as a header so tools such as <a href="https://www.getpostman.com/" target="_blank">Postman</a> are useful to customize the API requests easily.</li></ul>
<p><a href="/theme/images/1*xbehZsrAZpOs_zB8QDD6Wg.png.png"><img src="/theme/images/1*xbehZsrAZpOs_zB8QDD6Wg.png.png" alt="Usage Plan example" style="width: 100%" loading="lazy"></a></p>
<ul><li>10) Ultimately, we want to have a nice-looking URL such as api.example.com instead of the long random URL above. First, we need to create a certificate for our subdomain to so that our connection supports <a href="https://en.wikipedia.org/wiki/Transport_Layer_Security" target="_blank">SSL</a> (https). Go to Certificate Manager and follow the steps to create a certificate managed by AWS:</li></ul>
<p><a href="/theme/images/1*FBijvl_qHBaCx2ue46fSzg.png.png"><img src="/theme/images/1*FBijvl_qHBaCx2ue46fSzg.png.png" alt="AWS Certificate Manager" style="width: 100%" loading="lazy"></a></p>
<ul><li>11) Now that we have a certificate available, return to API Gateway and go to Custom Domain Names and create the API name (such as api.example.com) and select the ACM certificate created in previous step. Map it to your API deployment. This will generate a Target domain name of the form xyz1234567899.cloudfront.net.</li></ul>
<p><a href="/theme/images/1*et1Xy6T2IAhI2O475rC4Kw.png.png"><img src="/theme/images/1*et1Xy6T2IAhI2O475rC4Kw.png.png" alt="Custom Domain Name" style="width: 100%" loading="lazy"></a></p>
<ul><li>12) Return to your domain registrar’s DNS records, and create a CNAME record pointing to the target domain name above (such as xyz1234567899.cloudfront.net). Now once DNS records propagate, requesting api.example.com is going to terminate at your Lambda function and will be accessible by your Firebase frontend.</li></ul>
<p>That’s it! Now you can deploy a fully featured web app with a custom backend, URL and generous free tiers (as of August 2018). With a little bit of practice the process takes about an hour subject to DNS propagation and requires virtually no backend deployment knowledge. It scales well for most small to medium-sized apps that do not require specialized compute-intensive workloads such as Machine Learning (see my ML deployment article <a href="https://medium.com/coinmonks/machine-learning-tutorial-4-deployment-79764123e9e1" target="_blank">here</a>).</p>Machine Learning Tutorial #4: Deployment2018-09-02T00:00:00-05:002018-09-02T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2018-09-02:/blog/machine-learning-tutorial-4-deployment.html<p><a href="/theme/images/1*T_-rIQ8yUgPba_ezxt6ogg.png.png"><img src="/theme/images/1*T_-rIQ8yUgPba_ezxt6ogg.png.png" alt="Machine Learning project overview. Author: Adam Novotny" style="width: 100%" loading="lazy"></a></p>
<p>In this final phase of the series, I will suggest a few options ML engineers have to deploy their code. In large organizations, this part of the project will be handled by a specialized team which is especially important when scaling is a concern. Other tutorials in this series: <a href="https://medium.com/coinmonks/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">#1 …</a></p><p><a href="/theme/images/1*T_-rIQ8yUgPba_ezxt6ogg.png.png"><img src="/theme/images/1*T_-rIQ8yUgPba_ezxt6ogg.png.png" alt="Machine Learning project overview. Author: Adam Novotny" style="width: 100%" loading="lazy"></a></p>
<p>In this final phase of the series, I will suggest a few options ML engineers have to deploy their code. In large organizations, this part of the project will be handled by a specialized team which is especially important when scaling is a concern. Other tutorials in this series: <a href="https://medium.com/coinmonks/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">#1 Preprocessing</a>, <a href="https://medium.com/coinmonks/machine-learning-tutorial-2-training-f6f735830838" target="_blank">#2 Training</a>, <a href="https://medium.com/coinmonks/machine-learning-tutorial-3-evaluation-a157f90914c9" target="_blank">#3 Evaluation</a> , #4 Deployment (this article). <a href="https://github.com/adam5ny/blogs/tree/master/ml-deployment" target="_blank">Github code</a>.</p>
<h4>Stack Selection</h4>
<p>The number of options to deploy ML code is numerous but I typically decide between at least the three general buckets:</p>
<ul><li>Solution provided as-a-service (e.g. Microsoft Azure Machine Learning Studio)</li></ul>
<ul><li>Serverless function (e.g. <a href="https://docs.aws.amazon.com/lambda/latest/dg/python-programming-model.html" target="_blank">AWS Lambda</a>)</li></ul>
<ul><li>Custom backend code (e.g. <a href="http://flask.pocoo.org/docs/0.12/" target="_blank">Python Flask</a> served by <a href="https://devcenter.heroku.com/articles/getting-started-with-python" target="_blank">Heroku</a>)</li></ul>
<h4>As-a-service solution</h4>
<p>Platforms such as Microsoft Azure Machine Learning Studio offer the full suite of tools for the entire project including preprocessing and training. Custom API endpoints are usually easy to generate and writing code is often not necessary thanks to drag-and-drop interfaces. The solutions are often well optimized for <a href="https://en.wikipedia.org/wiki/Lazy_learning" target="_blank">lazy learners</a> where evaluation is the most expensive computational step. The downside is that it is sometimes more challenging to bring in custom code (such as the final model) without going through all the project steps on the platform.</p>
<p><a href="/theme/images/1*4F3z9NovnqtOtIRWRCJn_Q.jpeg.png"><img src="/theme/images/1*4F3z9NovnqtOtIRWRCJn_Q.jpeg.png" alt="As-a-service deployment example: Microsoft Azure" style="width: 100%" loading="lazy"></a></p>
<h4>Serverless function</h4>
<p>Serverless functions are a good solution for inexpensive computations. AWS uses default timeout of 3 seconds for a function to complete. While timeouts can be extended, the default value is often a good general guideline when deciding about suitability. Lambda only allows 50MB of custom code to be uploaded which is generally not enough for most machine learning purposes. However, functions are well suited for fast computations such as linear regression models. Another downside is that platforms support only specific languages. In terms of Python solutions, AWS Lambda supports versions 2.7 and 3.6 only at the time of writing this article.</p>
<h4>Custom backend code</h4>
<p>Writing a custom backend code on platform such as Heroku or Amazon’s EC2 allows us to replicate fully the code we write on local machines. The code and server deployment can be fully customized for the type of ML algorithm we are deploying. The downside of such solutions is their operational complexity because we need to focus on many steps unrelated to ML such as security.</p>
<p>I will deploy the code on <a href="https://devcenter.heroku.com/articles/getting-started-with-python" target="_blank">Heroku</a> which offers a free tier for testing purposes. The lightweight <a href="http://flask.pocoo.org/" target="_blank">Flask framework</a> will drive the backend. The primary reason for this choice is that it allows us to reuse essentially all the code written in previous tutorials for the backend. We can install Flask with Python 3.6 and all machine learning libraries we use previously side by side.</p>
<p>The entire backend code to run the app is literally a few lines long with Flask:</p>
<pre>import pickle
import pandas as pd
from flask import Flask, jsonify, request, make_response</pre>
<pre>app = Flask(__name__)</pre>
<pre><a href="http://twitter.com/app" target="_blank">@app</a>.route('/forecast', methods=["POST"])
def forecast_post():
"""
Args:
request.data: json pandas dataframe
example: {
"columns": ["date", "open", "high", "low", "close",
"volume"],
"index":[1, 0],
"data": [
[1532390400000, 108, 108, 107, 107, 26316],
[1532476800000, 107, 111, 107, 110, 30702]]
}
"""
if request.data:
df = pd.read_json(request.data, orient='split')
X = preprocess(df)
model = pickle.load(open("dtree_model.pkl", "rb"))
y_pred = run_model(X, model)
resp = make_response(jsonify({
"y_pred": json.dumps(y_pred.tolist())
}), 200)
return resp
else:
return make_response(jsonify({"message": "no data"}), 400)</pre>
<ul><li>pd.read_json(…): reads data from <a href="https://en.wikipedia.org/wiki/POST_(HTTP)" target="_blank">POST request</a> which is a json object corresponding to price data formatted the same way as Yahoo finance prices (our original data source)</li></ul>
<ul><li>preprocess(…): copy of our code from the <a href="https://medium.com/coinmonks/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">Preprocessing</a> tutorial that manipulates raw price data into features. Importantly, the scaler used must be the exact same we used in Preprocessing so it has to be saved to pickle file first during Preprocessing and loaded from pickle now</li></ul>
<ul><li>run_model(…): loads and runs our saved final model from the <a href="https://medium.com/coinmonks/machine-learning-tutorial-2-training-f6f735830838" target="_blank">Training</a> tutorial</li></ul>
<ul><li>make_response(…): returns forecasts</li></ul>
<h4>Heroku</h4>
<p>Deploying our prediction code to Heroku will require that we collect at least two necessary pieces of our code from previous tutorials: the final model (saved as a pickle file) and the code from the <a href="https://medium.com/coinmonks/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">Preprocessing</a> tutorial that transforms the original features we collected from the real world to features our model can handle.</p>
<p>I will not go into details about how to deploy a Docker app on Heroku. There are plenty of good materials including Heroku’s documentation, which is excellent. All the necessary code to run and deploy the Docker app on Heroku is also in the <a href="https://github.com/adam5ny/blogs/tree/master/ml-deployment" target="_blank">Github </a>repo. There are a few key steps to remember:</p>
<ul><li>Save Dockerfile as Dockerfile.web which is a container of all code necessary to run the app</li></ul>
<ul><li>Deploy container using command <a href="https://devcenter.heroku.com/articles/container-registry-and-runtime" target="_blank">heroku container:push</a></li></ul>
<ul><li>Release container using command <a href="https://devcenter.heroku.com/articles/container-registry-and-runtime" target="_blank">heroku container:release</a></li></ul>
<p>At this point our code is deployed which we can test using <a href="https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjmut-U1JvdAhVKsqQKHaQUBg0QFjAAegQIBRAC&url=https%3A%2F%2Fwww.getpostman.com%2F&usg=AOvVaw1vWzpwzQOHi5ErKZnywLDR" target="_blank">Postman</a> to make a manual forecast request:</p>
<p><a href="/theme/images/1*5kvKnVEez88tZ96uTtOqjg.png.png"><img src="/theme/images/1*5kvKnVEez88tZ96uTtOqjg.png.png" alt="Postman sample request" style="width: 100%" loading="lazy"></a></p>
<p>The date is represented by Unix timestamp. The first Body window consists of inputs we provide to the endpoint in the form of prices. The second window returns forecasts from the app.</p>
<h4>Testing</h4>
<p>To test the implementation, I will reuse the code from the Evaluation step. However, instead of making predictions locally using our sklearn model, I will use the Heroku app to predict the 691 samples from Evaluation as a batch. The goal is for our predictions we made on a local machine to perfectly match those made using our deployment stack.</p>
<p>This step is critical to ensure that we can replicate our results remotely using a pre-trained model. The testing code is also available on <a href="https://github.com/adam5ny/blogs/blob/master/ml-deployment/backend/tests/test_app.py" target="_blank">Github</a>. We confirm that the performance of our Heroku app matches the performance generated locally in the Evaluation tutorial:</p>
<p><a href="/theme/images/1*Oewaabcu926MZpC-zFFpHQ.png.png"><img src="/theme/images/1*Oewaabcu926MZpC-zFFpHQ.png.png" alt="Tested deployment performance matches evaluation results" style="width: 100%" loading="lazy"></a></p>
<p>To conclude, the project is intended to provide an overview of the kind of thinking a data science project entails. The code should not be used in production and is provided solely for illustrative purposes. As always, I welcome all constructive feedback (positive or negative) on <a href="https://twitter.com/adam5ny" target="_blank">Twitter</a>.</p>
<p>Other tutorials in this series: <a href="https://medium.com/coinmonks/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">#1 Preprocessing</a>, <a href="https://medium.com/coinmonks/machine-learning-tutorial-2-training-f6f735830838" target="_blank">#2 Training</a>, <a href="https://medium.com/coinmonks/machine-learning-tutorial-3-evaluation-a157f90914c9" target="_blank">#3 Evaluation</a>, #4 Deployment (this article). <a href="https://github.com/adam5ny/blogs/tree/master/ml-deployment" target="_blank">Github code</a>.</p>Machine Learning Tutorial #3: Evaluation2018-08-19T00:00:00-05:002018-08-19T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2018-08-19:/blog/machine-learning-tutorial-3-evaluation.html<p><a href="/theme/images/1*iPgIcpnc-nzkigs6RaTZBw.png.png"><img src="/theme/images/1*iPgIcpnc-nzkigs6RaTZBw.png.png" alt="Machine Learning project overview. Author: Adam Novotny" style="width: 100%" loading="lazy"></a></p>
<p>In this third phase of the series, I will explore the Evaluation part of the ML project. I will reuse some of the code and solutions from the second Training phase. However, it is important to note that the Evaluation phase should be completely separate from training except for using …</p><p><a href="/theme/images/1*iPgIcpnc-nzkigs6RaTZBw.png.png"><img src="/theme/images/1*iPgIcpnc-nzkigs6RaTZBw.png.png" alt="Machine Learning project overview. Author: Adam Novotny" style="width: 100%" loading="lazy"></a></p>
<p>In this third phase of the series, I will explore the Evaluation part of the ML project. I will reuse some of the code and solutions from the second Training phase. However, it is important to note that the Evaluation phase should be completely separate from training except for using the final model produced in the Training step. Other tutorials in this series: <a href="https://medium.com/coinmonks/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">#1 Preprocessing</a>, <a href="https://medium.com/coinmonks/machine-learning-tutorial-2-training-f6f735830838" target="_blank">#2 Training</a>, #3 Evaluation (this article), <a href="https://medium.com/@adam5ny/machine-learning-tutorial-4-deployment-79764123e9e1" target="_blank">#4 Prediction</a>. <a href="https://github.com/adam5ny/blogs/tree/master/ml-evaluation" target="_blank">Github code</a>.</p>
<h4>Performace Metrics</h4>
<p>The goal of this section is to determine how our model from the Training step performs on real life data it has not learned from. First, we have to load the model we saved as the Final model:</p>
<pre>model = pickle.load(open("dtree_model.pkl", "rb"))
>>> model
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=5, min_weight_fraction_leaf=0.0, presort=False, random_state=1, splitter='best')</pre>
<p>Next, we will load the testing data we created in the Preprocessing part of this tutorial. The primary reason why I keep the Evaluation section separate from Training is precisely this step. I keep the code separate as well to ensure that no information from training leaks into evaluation. To restate, we should have not seen the data used in this section at any point until now.</p>
<pre>X = pd.read_csv("X_test.csv", header=0)
y = pd.read_csv("y_test.csv", header=0)</pre>
<p>At this stage, we may perform additional performance evaluation on top of the Training step. However, I will stick to the metrics used previously: MAE, MSE, R2.</p>
<p><a href="/theme/images/1*5Cjov6KncJ3qLJ-fHi6MTg.png.png"><img src="/theme/images/1*5Cjov6KncJ3qLJ-fHi6MTg.png.png" alt="Decision tree MAE, MSE, R2" style="width: 100%" loading="lazy"></a></p>
<h4>Commentary</h4>
<p>We have known that our model does not perform well enough in practice from the previous tutorial already. However, as I mentioned before, I went ahead and used it for illustrative purposes here in order to complete the tutorial and to explain the kind of thinking involved in real life projects where performance is not always ideal out of the box as many toy datasets would make one think.</p>
<p>The key comparison is how well does our model evaluate relative to the training phase. In the case of models ready for production, I would expect the performance in the Evaluation step to be comparable to those of testing folds in the Training phase.</p>
<p>Comparing the last training test fold <a href="https://medium.com/coinmonks/machine-learning-tutorial-2-training-f6f735830838" target="_blank">here</a> (5249 datapoints used to train) and the Evaluation results above:</p>
<ul><li>MAE: final Training phase ~10^-2. Evaluation phase ~10^-2</li></ul>
<ul><li>MSE: final Training phase ~10^-4. Evaluation phase ~10^-3</li></ul>
<ul><li>R²: final Training phase ~0. Evaluation phase ~0</li></ul>
<p>The performance on dataset the model has never seen before is reasonably similar. Nonetheless, overfitting is still something to potentially address. If we had a model ready for production from the Training phase, we would be reasonably confident at this stage that it would perform as we expect on out of sample data.</p>
<p>Other tutorials in this series: <a href="https://medium.com/coinmonks/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">#1 Preprocessing</a>, <a href="https://medium.com/coinmonks/machine-learning-tutorial-2-training-f6f735830838" target="_blank">#2 Training</a>, #3 Evaluation (this article), <a href="https://medium.com/@adam5ny/machine-learning-tutorial-4-deployment-79764123e9e1" target="_blank">#4 Prediction</a></p>Machine Learning Tutorial #2: Training2018-08-12T00:00:00-05:002018-08-12T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2018-08-12:/blog/machine-learning-tutorial-2-training.html<p><a href="/theme/images/1*iPgIcpnc-nzkigs6RaTZBw.png.png"><img src="/theme/images/1*iPgIcpnc-nzkigs6RaTZBw.png.png" alt="Machine Learning project overview. Author: Adam Novotny" style="width: 100%" loading="lazy"></a></p>
<p>This second part of the ML Tutorial follows up on the first <a href="https://medium.com/@adam5ny/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">Preprocessing</a> part. All code is available in this <a href="https://github.com/adam5ny/blogs/tree/master/ml-training" target="_blank">Github repo</a>. Other tutorials in this series: <a href="https://medium.com/coinmonks/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">#1 Preprocessing</a>, #2 Training (this article), <a href="https://medium.com/@adam5ny/machine-learning-tutorial-3-evaluation-a157f90914c9" target="_blank">#3 Evaluation</a> , <a href="https://medium.com/@adam5ny/machine-learning-tutorial-4-deployment-79764123e9e1" target="_blank">#4 Prediction</a></p>
<p>I concluded Tutorial #1 with 4 datasets: training features, testing features, training target …</p><p><a href="/theme/images/1*iPgIcpnc-nzkigs6RaTZBw.png.png"><img src="/theme/images/1*iPgIcpnc-nzkigs6RaTZBw.png.png" alt="Machine Learning project overview. Author: Adam Novotny" style="width: 100%" loading="lazy"></a></p>
<p>This second part of the ML Tutorial follows up on the first <a href="https://medium.com/@adam5ny/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">Preprocessing</a> part. All code is available in this <a href="https://github.com/adam5ny/blogs/tree/master/ml-training" target="_blank">Github repo</a>. Other tutorials in this series: <a href="https://medium.com/coinmonks/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">#1 Preprocessing</a>, #2 Training (this article), <a href="https://medium.com/@adam5ny/machine-learning-tutorial-3-evaluation-a157f90914c9" target="_blank">#3 Evaluation</a> , <a href="https://medium.com/@adam5ny/machine-learning-tutorial-4-deployment-79764123e9e1" target="_blank">#4 Prediction</a></p>
<p>I concluded Tutorial #1 with 4 datasets: training features, testing features, training target variables, and testing target variables. Only training features and and training target variables will be used in this Tutorial #2. The testing data will be used for evaluation purposes in Tutorial #3.</p>
<h4>Performance Metrics</h4>
<p>We are focused on regression algorithms so I will consider 3 most often used performance metrics</p>
<ul><li><a href="https://en.wikipedia.org/wiki/Mean_absolute_error" target="_blank">Mean Absolute Error</a> (MAE)</li></ul>
<ul><li><a href="https://en.wikipedia.org/wiki/Mean_squared_error" target="_blank">Mean Squared Error</a> (MSE)</li></ul>
<ul><li><a href="https://en.wikipedia.org/wiki/Coefficient_of_determination" target="_blank">R²</a></li></ul>
<p>In practice, a domain-specific decision could be made to supplement the standard metrics above. For example, investors are typically more concerned about significant downside errors rather than upside errors. As a result, a metric could be derived that overemphasizes downside errors corresponding to financial losses.</p>
<h4>Cross Validation</h4>
<p>I will return to the same topic I addressed in <a href="https://medium.com/@adam5ny/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">Preprocessing</a>. Due to the nature of time series data, standard randomized K-fold validation produces forward looking bias and should not be used. To illustrate the issue here, let’s assume that we split 8 years of data into 8 folds, each representing one year. The first training cycle will use folds #1–7 for training and fold #8 for testing. The next training cycle may use folds #2–8 for training and fold #1 for testing. This is of course unacceptable because we are using data from years 2–7 to forecast year 1.</p>
<p>Our cross validation must respect the temporal sequence of the data. We can use Walk Forward Validation or simply multiple Train-Test Splits. For illustration, I will use 3 Train-Test splits. For example, let’s assume we have 2000 samples sorted by timestamp from the earliest. Our 3 segments would look as follows:</p>
<p><a href="/theme/images/1*cFti5rqcbFrE5p_4My4eww.png.png"><img src="/theme/images/1*cFti5rqcbFrE5p_4My4eww.png.png" alt="Train-Test splits. Author: Adam Novotny" style="width: 100%" loading="lazy"></a></p>
<h4>Model Selection</h4>
<p><a href="/theme/images/1*M2clZWay68ODL2jEB-J5Dw.png.png"><img src="/theme/images/1*M2clZWay68ODL2jEB-J5Dw.png.png" alt="ML Model Selection. Author: Adam Novotny" style="width: 100%" loading="lazy"></a></p>
<p>In this section, I will select the models to train. The “Supervised” algorithms section (red section in the image above) is relevant because the dataset contains both features and labels (target variables). I like to follow <a href="https://en.wikipedia.org/wiki/Occam%27s_razor" target="_blank">Occam’s razor</a> when it comes to algorithms selection. In other words, start with the algorithm that exhibits the fastest times to train and the greatest interpretability. Then we can increase complexity.</p>
<p>I will explore the following algorithms in this section:</p>
<ul><li>Linear Regression: fast to learn, easy to interpret</li></ul>
<ul><li>Decision Trees: fast to learn (requires pruning), easy to interpret</li></ul>
<ul><li>Neural Networks: slow to learn, hard to interpret</li></ul>
<h4>Linear Regression</h4>
<p>Starting with linear regression is useful to see if we can “get away” with simple statistics to achieve our goal before diving into complex machine learning algorithms. House price forecasting with clearly defined features is an example where linear regression often works well and using more complex algorithms is unnecessary.</p>
<p>Training a linear regression model using sklearn is simple:</p>
<pre>from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)</pre>
<p>Initial results yielded nothing remotely promising so I took another step and transformed features further. I created polynomial and nonlinear features to account for nonlinear relationships. For example, features [a, b] become [1, a, b, a², ab, b²] in the case of degree-2 polynomial.</p>
<p><a href="/theme/images/1*c9-D9EJoTwKsRlJuK_WQaQ.png.png"><img src="/theme/images/1*c9-D9EJoTwKsRlJuK_WQaQ.png.png" alt="Linear Regression results. Author: Adam Novotny" style="width: 100%" loading="lazy"></a></p>
<p>The x-axis represents 3 cross validation segments (the fold 1st uses 1749 samples for training and 1749 for testing, the 2nd uses 3499 for training and 1749 for testing, and the last uses 5249 for training and 1749 for testing). Clearly, the results suggest that the linear model is not useful in practice. At this stage I have at least the following options:</p>
<ul><li>Ridge regression: addresses overfitting (if any)</li></ul>
<ul><li>Lasso linear: reduces model complexity</li></ul>
<p>At this point, I don’t believe that any of the options above will meaningfully impact the outcome. I will move on to other algorithms to see how they compare.</p>
<p>Before moving on, however, I need to set expectations. There is a saying in finance that successful forecasters only need to be correct 51% of the time. Financial leverage can be used to magnify results so being just a little correct produces impactful outcomes. This sets expectations because we will never find algorithms that are constantly 60% correct or better in this domain. As a result, we expect low R² values. This needs to be said because many sample projects in machine learning are designed to look good, which we can never match in real-life price forecasting.</p>
<h4>Decision Tree</h4>
<p>Training a decision tree regressor model using sklearn is equally simple:</p>
<pre>from sklearn import tree
model = tree.DecisionTreeRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)</pre>
<p>The default results for the fit function above almost always <a href="https://en.wikipedia.org/wiki/Overfitting" target="_blank">overfit</a>. Decision trees have a very expressive hypothesis space so they can represent almost any function when not pruned. R² for training data can easily become perfect 1.0 while for testing data the result will be 0. We therefore need to use the max_depth argument of scikit-learn <a href="http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor" target="_blank">DecisionTreeRegressor</a> to enforce that the tree generalizes well for test data.</p>
<p>One of the biggest advantages of decision trees is their interpretability: see many useful <a href="https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176" target="_blank">visualization articles</a> using standard illustrative datasets.</p>
<h4>Neural Networks</h4>
<p>Scikit-learn makes <a href="http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor" target="_blank">simple neural network</a> training just as simple as building a decision tree:</p>
<pre>from sklearn.neural_network import MLPRegressor
model = MLPRegressor(hidden_layer_sizes=(200, 200), solver="lbfgs", activation="relu")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)</pre>
<p>Training a neural net with 2 hidden layers (of 200 units each) and polynomial features starts taking tens of seconds on an average laptop. To speed up the training process in the next section, I will step away from scikit-learn and use <a href="https://keras.io/" target="_blank">Keras</a> with TensorFlow backend.</p>
<p>Keras API is equally simple. The project even includes <a href="https://keras.io/scikit-learn-api/#wrappers-for-the-scikit-learn-api" target="_blank">wrappers for scikit-learn</a> to take advantage of scikit’s research libraries.</p>
<pre>from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
input_size = len(X[0])
model.add(Dense(200, activation="relu", input_dim=input_size))
model.add(Dense(200, activation="relu"))
model.add(Dense(1, activation="linear"))
model.compile(optimizer="adam", loss="mse")
model.fit(X_train, y_train, epochs=25, verbose=1)
y_pred = model.predict(X_test)</pre>
<h4>Hyperparameter Optimization</h4>
<p>The trick to doing hyperparameter optimization is to understand that parameters should not be treated independently. Many parameters interact with each other which is why exhaustive <a href="https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search" target="_blank">grid search</a> is often performed. However, grid search is that it becomes expensive very quickly.</p>
<h4>Decision Tree</h4>
<p>Our decision tree grid search will iterate over the following inputs:</p>
<ul><li>splitter: strategy used to split nodes (best or random)</li></ul>
<ul><li>max depth of the tree</li></ul>
<ul><li>min samples per split: the minimum number of samples required to split an internal node</li></ul>
<ul><li>max leaf nodes: number or None (allow unlimited number of leaf nodes)</li></ul>
<p>Illustrative grid search results are below:</p>
<p><a href="/theme/images/1*CQHDbOr3_ZO7oWkdOYdKQw.png.png"><img src="/theme/images/1*CQHDbOr3_ZO7oWkdOYdKQw.png.png" alt="Grid Search Decision Tree — first rows" style="width: 100%" loading="lazy"></a>
<a href="/theme/images/1*DQHZ9reWIOMF6fq9eKA8IQ.png.png"><img src="/theme/images/1*DQHZ9reWIOMF6fq9eKA8IQ.png.png" alt="Grid Search Decision Tree — last rows" style="width: 100%" loading="lazy"></a></p>
<p>Performance using the best parameters:</p>
<p><a href="/theme/images/1*2qHu4Z1DiJGmx440QCyTAA.png.png"><img src="/theme/images/1*2qHu4Z1DiJGmx440QCyTAA.png.png" alt="Decision Tree results" style="width: 100%" loading="lazy"></a></p>
<p>Again, the results do not seem to be very promising. They appear to be better than linear regression (lower MAE and MSE) but R² is still too low to be useful. I would conclude, however, that the greater expressiveness of decision trees is useful and I would discard the linear regression model at this stage.</p>
<h4>Neural Networks</h4>
<p>Exploring the hyperparameters of the neural net build by Keras, we can alter at least the following parameters:</p>
<ul><li>number of hidden layers and/or units in each layer</li></ul>
<ul><li>model <a href="https://keras.io/optimizers/" target="_blank">optimizer</a> (SGD, Adam, etc)</li></ul>
<ul><li><a href="https://keras.io/activations/" target="_blank">activation function</a> in each layer (relu, tanh)</li></ul>
<ul><li>batch size: the number of samples per gradient update</li></ul>
<ul><li>epochs to train: the number of iterations over the entire training dataset</li></ul>
<p>Illustrative grid search results are below:</p>
<p><a href="/theme/images/1*c_fhFAu5NkphQXM6Do8QXQ.png.png"><img src="/theme/images/1*c_fhFAu5NkphQXM6Do8QXQ.png.png" alt="Grid Search Neural Net — first rows" style="width: 100%" loading="lazy"></a>
<a href="/theme/images/1*CZIYuZ9UrEdoVkWhOtfIWQ.png.png"><img src="/theme/images/1*CZIYuZ9UrEdoVkWhOtfIWQ.png.png" alt="Grid Search Neural Net — last rows" style="width: 100%" loading="lazy"></a></p>
<p>Using the best parameters, we obtain the following performance metrics:</p>
<p><a href="/theme/images/1*auWhs2uGbBher9adskreog.png.png"><img src="/theme/images/1*auWhs2uGbBher9adskreog.png.png" alt="Keras MAE, MSE, R2" style="width: 100%" loading="lazy"></a></p>
<p>Neural net and decision tree results are similar which is common. Both algorithms have very expressive hypothesis spaces and often produce comparable results. If I achieve comparable results, I tend to use the decision tree model for its faster training times and greater interpretability.</p>
<h4>Project Reflection</h4>
<p>At this stage it becomes clear that no model can be used in production. While the decision tree model appears to perform the best, its performance on testing data is still unreliable. At this stage, it would be time to go back and find additional features and/or data sources.</p>
<p>As I mentioned in the first <a href="https://medium.com/@adam5ny/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">Preprocessing Tutorial</a>, finance practitioners might spend months sourcing data and building features. Domain-specific knowledge is crucial and I would argue that financial markets exhibit at least the <a href="https://www.investopedia.com/exam-guide/cfa-level-1/securities-markets/weak-semistrong-strong-emh-efficient-market-hypothesis.asp" target="_blank">Weak-Form of Efficient Market Hypothesis</a>. This implies that future stock returns cannot be predicted from past price movements. I have used only past price movements to develop the models above so practitioners would notice already in the first tutorial that results would not be promising.</p>
<p>For the sake of completing this tutorial, I will go ahead and save the decision tree model and use it for illustrative purposes in the next sections of this tutorial (as if it were the Final production model):</p>
<pre>pickle.dump(model, open("dtree_model.pkl", "wb"))</pre>
<p>Important: there are <a href="https://www.cs.uic.edu/~s/musings/pickle/" target="_blank">known security vulnerabilities</a> in the Python pickle library. To stay on the safe side, the key takeaway is to never unpickle data you did not create.</p>
<h4>Tools</h4>
<p>Tooling is a common question but often not critical until the project is composed of tens of thousands of examples and at least hundreds of features. I typically start with scikit-learn and move elsewhere when performance becomes the bottleneck. <a href="https://www.tensorflow.org/" target="_blank">TensorFlow</a>, for example, is not just a deep learning framework but also contains other algorithms such as <a href="https://www.tensorflow.org/api_docs/python/tf/estimator/LinearRegressor" target="_blank">LinearRegressor</a>. We could train Linear Regression above with TensorFlow and GPUs if scikit-learn does not perform well enough.</p>
<p>Other tutorials in this series: <a href="https://medium.com/coinmonks/machine-learning-tutorial-1-preprocessing-d90198e37577" target="_blank">#1 Preprocessing</a>, #2 Training (this article), <a href="https://medium.com/@adam5ny/machine-learning-tutorial-3-evaluation-a157f90914c9" target="_blank">#3 Evaluation</a> , <a href="https://medium.com/@adam5ny/machine-learning-tutorial-4-deployment-79764123e9e1" target="_blank">#4 Prediction</a></p>Machine Learning Tutorial #1: Preprocessing2018-08-05T00:00:00-05:002018-08-05T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2018-08-05:/blog/machine-learning-tutorial-1-preprocessing.html<p>In this machine learning tutorial, I will explore 4 steps that define a typical machine learning project: Preprocessing, Learning, Evaluation, and Prediction (deployment). In this first part, I will complete the Preprocessing step. Other tutorials in this series: #1 Preprocessing (this article), <a href="https://medium.com/coinmonks/machine-learning-tutorial-2-training-f6f735830838" target="_blank">#2 Training</a>, <a href="https://medium.com/@adam5ny/machine-learning-tutorial-3-evaluation-a157f90914c9" target="_blank">#3 Evaluation</a> , <a href="https://medium.com/@adam5ny/machine-learning-tutorial-4-deployment-79764123e9e1" target="_blank">#4 Prediction</a></p>
<p><a href="/theme/images/1*iPgIcpnc-nzkigs6RaTZBw.png.png"><img src="/theme/images/1*iPgIcpnc-nzkigs6RaTZBw.png.png" alt="Machine Learning project overview. Author: Adam Novotny" style="width: 100%" loading="lazy"></a></p>
<p>I will …</p><p>In this machine learning tutorial, I will explore 4 steps that define a typical machine learning project: Preprocessing, Learning, Evaluation, and Prediction (deployment). In this first part, I will complete the Preprocessing step. Other tutorials in this series: #1 Preprocessing (this article), <a href="https://medium.com/coinmonks/machine-learning-tutorial-2-training-f6f735830838" target="_blank">#2 Training</a>, <a href="https://medium.com/@adam5ny/machine-learning-tutorial-3-evaluation-a157f90914c9" target="_blank">#3 Evaluation</a> , <a href="https://medium.com/@adam5ny/machine-learning-tutorial-4-deployment-79764123e9e1" target="_blank">#4 Prediction</a></p>
<p><a href="/theme/images/1*iPgIcpnc-nzkigs6RaTZBw.png.png"><img src="/theme/images/1*iPgIcpnc-nzkigs6RaTZBw.png.png" alt="Machine Learning project overview. Author: Adam Novotny" style="width: 100%" loading="lazy"></a></p>
<p>I will use stock price data as the main dataset. There are a few reasons why this is a good choice for the tutorial:</p>
<ul><li>The dataset is public by definition and can be easily downloaded from multiple sources so anyone can replicate the work.</li></ul>
<ul><li>Not all features are immediately available from the source and need to be extracted using domain knowledge, resembling real life.</li></ul>
<ul><li>The outcome of the project is highly uncertain which again simulates real life. Billions of dollars are thrown at the stock price prediction problem every year and the vast majority of projects fail. This tutorial is therefore not about creating a magical money-printing machine; it is about replicating the experience a machine learning engineer might have with a project.</li></ul>
<p>All code is located at the following <a href="https://github.com/adam5ny/blogs/tree/master/ml-preprocessing" target="_blank">Github repo</a>. The file “preprocessing.py” drives the analysis. Python 3.6 is recommended and the file includes directions to setup all necessary dependencies.</p>
<p>First we need to download the dataset. I will somewhat arbitrarily choose the Microsoft stock data (source: <a href="https://finance.yahoo.com/quote/MSFT/history?p=MSFT" target="_blank">Yahoo Finance</a>). I will use the entire available history which at the time of writing includes 3/13/1986 — 7/30/2018. The share price performed as follows during this period:</p>
<p><a href="/theme/images/1*lR8eaHKYLjtKsZY_J19pog.png.png"><img src="/theme/images/1*lR8eaHKYLjtKsZY_J19pog.png.png" alt="MSFT stock price. Source https://finance.yahoo.com/chart/MSFT" style="width: 100%" loading="lazy"></a></p>
<p>The price movement is interesting because it exhibits at least two modes of behavior:</p>
<ul><li>the steep rise until the year 2000 when tech stocks crashed</li></ul>
<ul><li>the sideways movement since 2000</li></ul>
<p>This makes for a number of interesting machine learning complexities such as the sampling of training and testing data.</p>
<h4>Data Cleaning</h4>
<p>After some simple manipulations and loading of the csv data into pandas DataFrame, we have the following dataset where open, high, low and close represent prices on each date and volume the total number of shares traded.</p>
<p><a href="/theme/images/1*psQ_9EoBHpiN78QgreAVOQ.png.png"><img src="/theme/images/1*psQ_9EoBHpiN78QgreAVOQ.png.png" alt="Raw dataset includes columns: date, prices (open, high, low, close), trading volume" style="width: 100%" loading="lazy"></a>
<a href="/theme/images/1*pF4V5GC6b2vfC-koQkAynw.png.png"><img src="/theme/images/1*pF4V5GC6b2vfC-koQkAynw.png.png" alt="Raw dataset includes columns: date, prices (open, high, low, close), trading volume" style="width: 100%" loading="lazy"></a></p>
<p>Missing values are not present which I confirmed by running the following command:</p>
<pre>missing_values_count = df.isnull().sum()</pre>
<p><a href="/theme/images/1*RxQtAFDbviXDbYcxU8j02Q.png.png"><img src="/theme/images/1*RxQtAFDbviXDbYcxU8j02Q.png.png" alt="No missing values in dataset" style="width: 100%" loading="lazy"></a></p>
<p>Outliers are the next topic I need to address. The key point to understand here is that our dataset now includes prices but prices are not the metric I will attempt to forecast because they are measured in absolute terms and therefore harder to compare across time and other assets. In the tables above, the first price available is ~$0.07 while the last is $105.37.</p>
<p>Instead, I will attempt to forecast daily returns. For example, at the end of the second trading day the return was +3.6% (0.073673/0.071132). I will therefore create a return column and use it to analyze possible outliers.</p>
<p>The 5 smallest daily returns present in the dataset are the following:</p>
<p><a href="/theme/images/1*-FluO_dSIB7Gc8Rlvgry_w.png.png"><img src="/theme/images/1*-FluO_dSIB7Gc8Rlvgry_w.png.png" alt="5 smallest daily returns" style="width: 100%" loading="lazy"></a></p>
<p>And 5 largest daily returns:</p>
<p><a href="/theme/images/1*CvJuQYGojLfLlxpnm9Ut1w.png.png"><img src="/theme/images/1*CvJuQYGojLfLlxpnm9Ut1w.png.png" alt="5 largest daily returns" style="width: 100%" loading="lazy"></a></p>
<p>The most negative return is -30% (index 405) and the largest is 20% (index 3692). Normally, a further domain-specific analysis of the outliers is necessary here. I will skip it for now and assume this tutorial outlines the process for illustrative purposes only. Generally, the data appears to make sense given that in 1987 and 2000 market crashes took place associated with extremely volatility.</p>
<p>The same analysis would be required for open, high, low and volume columns. Admittedly, data cleaning was somewhat academic because Yahoo Finance is a very widely used and reliable source. It is still a useful exercise to understand the data.</p>
<h4>Target Variable Selection</h4>
<p>We need to define what our ML algorithms will attempt to forecast. Specifically, we will forecast next day’s return. The timing of returns is important here so we are not mistakenly forecasting today’s or yesterday’s return. The formula to define tomorrow’s return as our target variable is as follows:</p>
<pre>df["y"] = df["return"].shift(-1)</pre>
<h4>Feature Extraction</h4>
<p>Now I will turn to some simple transformations of the prices, returns and volume to <a href="https://en.wikipedia.org/wiki/Feature_extraction" target="_blank">extract features</a> ML algorithms can consume. Finance practitioners have developed 100s of such features but I will only show a few. Hedge funds spent the vast majority of time on this step because ML algorithms are generally only as useful as the data available, aka. “garbage in, garbage out”.</p>
<p>One feature we might consider is how today’s closing price relates to that of 5 trading days ago (one calendar week). I call this feature “5d_momentum”:</p>
<pre>df[“5d_momentum”] = df[“close”] / df[“close”].shift(5)</pre>
<p><a href="/theme/images/1*4dWC4F1sqjmpW5dohmF-Mg.png.png"><img src="/theme/images/1*4dWC4F1sqjmpW5dohmF-Mg.png.png" alt="New 5d_momentum feature" style="width: 100%" loading="lazy"></a></p>
<p>One typical trend following feature is <a href="https://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:moving_average_convergence_divergence_macd" target="_blank">MACD</a> (Moving Average Convergence/Divergence Oscillator). The strengths of pandas shine here because MACD can be created in only 4 lines of code. The chart of the MACD indicator is below. On the lower graph, a typical buy signal would be the blue “macd_line” crossing above the orange line representing a 9-day exponential moving average of the “macd_line”. The inverse would represent a sell signal.</p>
<p><a href="/theme/images/1*cA6MrDLu1Fuwd4pIDoS0fQ.png.png"><img src="/theme/images/1*cA6MrDLu1Fuwd4pIDoS0fQ.png.png" alt="MACD of stock price" style="width: 100%" loading="lazy"></a></p>
<p>The python code “generate_features.py” located in the Github repo mentioned above includes additional features we might consider. For example:</p>
<ul><li><a href="https://www.investopedia.com/articles/active-trading/052014/how-use-moving-average-buy-stocks.asp" target="_blank">Trend: Moving Average</a></li></ul>
<p><a href="/theme/images/1*2ip_ErJJ73742mxoNoGknA.png.png"><img src="/theme/images/1*2ip_ErJJ73742mxoNoGknA.png.png" alt="MSFT Moving Average 50 day — 200 day" style="width: 100%" loading="lazy"></a></p>
<ul><li><a href="https://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:parabolic_sar" target="_blank">Trend: Parabolic SAR</a></li></ul>
<p><a href="/theme/images/1*uRA6nhA4QpXoqhl6dfECrg.png.png"><img src="/theme/images/1*uRA6nhA4QpXoqhl6dfECrg.png.png" alt="MSFT SAR" style="width: 100%" loading="lazy"></a></p>
<ul><li><a href="https://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:stochastic_oscillator_fast_slow_and_full" target="_blank">Momentum: Stochastic Oscillator</a></li></ul>
<p><a href="/theme/images/1*Qt0JrOJuvdUBelJ_ddGO1g.png.png"><img src="/theme/images/1*Qt0JrOJuvdUBelJ_ddGO1g.png.png" alt="MSFT Stochastic Oscillator" style="width: 100%" loading="lazy"></a></p>
<ul><li><a href="https://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:commodity_channel_index_cci" target="_blank">Momentum: Commodity Channel Index (CCI)</a></li></ul>
<ul><li><a href="https://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:relative_strength_index_rsi" target="_blank">Momentum: Relative Strength Index (RSI)</a></li></ul>
<ul><li><a href="https://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:bollinger_bands" target="_blank">Volatility: Bollinger Bands</a></li></ul>
<ul><li><a href="https://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:average_true_range_atr" target="_blank">Volatility: Average True Range</a></li></ul>
<ul><li><a href="https://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:on_balance_volume_obv" target="_blank">Volume: On Balance Volume (OBV)</a></li></ul>
<ul><li><a href="https://stockcharts.com/school/doku.php?id=chart_school:technical_indicators:chaikin_oscillator" target="_blank">Volume: Chaikin Oscillator</a></li></ul>
<p>At the end of the feature extraction process, we have the following features:</p>
<pre>['return', 'close_to_open', 'close_to_high', 'close_to_low', 'macd_diff', 'ma_50_200', 'sar', 'stochastic_oscillator', 'cci', 'rsi', '5d_volatility', '21d_volatility', '60d_volatility', 'bollinger', 'atr', 'on_balance_volume', 'chaikin_oscillator']</pre>
<h4>Sampling</h4>
<p>We need to split the data into training and testing buckets. I cannot stress enough that the testing dataset should never be used in the Learning step. It will be used only in the Evaluation step so that performance metrics are completely independent of training and represent an unbiased estimate of actual performance.</p>
<p>Normally, we could randomize the sampling of testing data but time series data is often not well suited for randomized sampling. The reason being that would would bias the learning process. For example, randomization could produce a situation where the data point from 1/1/2005 is used in the Learning step to later forecast a return from 1/1/2003.</p>
<p>I will therefore choose a much simpler way to sample the data and use the first 7000 samples as training dataset for Learning and the remaining 962 as testing dataset for Evaluation.</p>
<p>Both datasets will be saved as csv files so we conclude this part of the ML tutorial by storing 4 files (MSFT_X_learn.csv, MSFT_y_learn.csv, MSFT_X_test.csv, MSFT_y_test.csv). These will be consumed by the next steps of this tutorial.</p>
<h4>Scaling</h4>
<p>Feature scaling is used to reduce the time to Learn. This typically applies to <a href="https://en.wikipedia.org/wiki/Feature_scaling#Application" target="_blank">stochastic gradient descent and SMV</a>.</p>
<p>The open source<a href="http://scikit-learn.org/stable/index.html" target="_blank"> sklearn</a> package will be used for most additional ML application so I will start using it here to <a href="http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling" target="_blank">scale all features</a> to have zero mean and unit variance:</p>
<pre>from sklearn import preprocessing
scaler_model = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler_model.transform(X_train)
X_test_scaled = scaler_model.transform(X_test)</pre>
<p>It is important that data sampling takes place before features are modified to avoid any training to testing data leakage.</p>
<h4>Dimensionality Reduction</h4>
<p>At this stage, our dataset 17 features. The number of features has a significant impact on the speed of learning. We could use a number of techniques to try to reduce the number of features so that only the most “useful” features remain.</p>
<p>Many hedge funds would be working with 100s of features at this stage so dimensional reduction would be critical. In our case, we only have 17 illustrative features so I will keep them all in the dataset until I explore the learning times of different algorithms.</p>
<p>Out of curiosity however, I will perform Principal Component Analysis <a href="http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html" target="_blank">(PCA) </a>to get an idea of how many features we could create from our dataset without losing meaningful explanatory power.</p>
<pre>from sklearn.decomposition import PCA
sk_model = PCA(n_components=10)
sk_model.fit_transform(features_ndarray)
print(sk_model.explained_variance_ratio_.cumsum())
[0.30661571 0.48477408 0.61031358 0.71853895 0.78043556 0.83205298
0.8764804 0.91533986 0.94022672 0.96216244]</pre>
<p>The first 8 features explain 91.5% of data variance. The downside of PCA is that new features are located in a lower dimensional space so they no longer correspond the real-life concepts. For example, the first original feature could be “macd_line” I derived above. After PCA, the first feature explains 31% of variance but we not longer have any logical description for what the feature represents in real life.</p>
<p>For now, I will keep all features 17 original features but note that if the learning time of algorithms is too slow, PCA will be helpful.</p>
<p>Other tutorials in this series: #1 Preprocessing (this article), <a href="https://medium.com/coinmonks/machine-learning-tutorial-2-training-f6f735830838" target="_blank">#2 Training</a>, <a href="https://medium.com/@adam5ny/machine-learning-tutorial-3-evaluation-a157f90914c9" target="_blank">#3 Evaluation</a> , <a href="https://medium.com/@adam5ny/machine-learning-tutorial-4-deployment-79764123e9e1" target="_blank">#4 Prediction</a></p>Linear programming in Python: CVXOPT and game theory2017-08-16T00:00:00-05:002017-08-16T00:00:00-05:00Adam Novotnytag:adamnovotny.com,2017-08-16:/blog/linear-programming-in-python-cvxopt-and-game-theory.html<p>CVXOPT is an excellent Python package for linear programming. However, when I was getting started with it, I spent way too much time getting it to work with simple game theory example problems. This tutorial aims to shorten the startup time for everyone trying to use CVXOPT for more advanced …</p><p>CVXOPT is an excellent Python package for linear programming. However, when I was getting started with it, I spent way too much time getting it to work with simple game theory example problems. This tutorial aims to shorten the startup time for everyone trying to use CVXOPT for more advanced problems.</p>
<p>All code is available <a href="http://github.com/adam5ny/blogs/tree/master/cvxopt" target="_blank">here</a>.</p>
<p>Installation of dependencies:</p>
<ul><li>Using Docker is the fastest way to run the code. In only 5 commands you can replicate my environment and run the code.</li></ul>
<ul><li>Alternatively, the code has the following dependencies: Python (3.5.3), numpy (1.12.1), cvxopt (1.1.9), glpk optimizer (but you can use the default optimizer, glpk is better for some more advanced problems)</li></ul>
<p>Please review <a href="http://cvxopt.org/examples/tutorial/lp.html" target="_blank">how CVXOPT solves simple maximization problems</a>. While this article focuses on game theory problems, it is critical to understand how CVXOPT defines optimization problems in general.</p>
<p>The first problem we will solve is a <a href="http://en.wikipedia.org/wiki/Minimax#Example" target="_blank">2-player zero-sum game</a>.</p>
<p>The constraints matrix A is defined as</p>
<pre>A = [[3, -2, 2], [-1, 0, 4] ,[-4, -3, 1]]</pre>
<p>Next, we define a maxmin helper function</p>
<pre>def maxmin(self, A, solver="glpk"):
num_vars = len(A)
minimize matrix c
c = [-1] + [0 for i in range(num_vars)]
c = np.array(c, dtype="float")
c = matrix(c)
constraints G*x <= h
G = np.matrix(A, dtype="float").T reformat each variable is in a row
G *= -1 minimization constraint
G = np.vstack([G, np.eye(num_vars) * -1]) > 0 constraint for all vars
new_col = [1 for i in range(num_vars)] + [0 for i in range(num_vars)]
G = np.insert(G, 0, new_col, axis=1) insert utility column
G = matrix(G)
h = ([0 for i in range(num_vars)] +
[0 for i in range(num_vars)])
h = np.array(h, dtype="float")
h = matrix(h)
contraints Ax = b
A = [0] + [1 for i in range(num_vars)]
A = np.matrix(A, dtype="float")
A = matrix(A)
b = np.matrix(1, dtype="float")
b = matrix(b)
sol = solvers.lp(c=c, G=G, h=h, A=A, b=b, solver=solver)
return sol</pre>
<p>Last, we use the maxmin helper function to solve our example problem:</p>
<pre>sol = maxmin(A=A, solver=”glpk”)
probs = sol[“x”]
print(probs)
[ 1.67e-01]
[ 8.33e-01]
[ 0.00e+00]</pre>
<p>In other words, player A chooses action 1 with probility 1/6 and action 2 with probability 5/6.</p>
<p>Next we will solve a Correlated Equilibrium problem called Game of Chicken as defined on page 3 of <a href="http://www.cs.rutgers.edu/~mlittman/topics/nips02/nips02/greenwald.ps" target="_blank">this document</a>. The constraints matrix A is defined as</p>
<pre>A = [[6, 6], [2, 7], [7, 2], [0, 0]]</pre>
<p>Next, we define a ce and build_ce_constraints helper functions:</p>
<pre>def ce(self, A, solver=None):
num_vars = len(A)
maximize matrix c
c = [sum(i) for i in A] sum of payoffs for both players
c = np.array(c, dtype="float")
c = matrix(c)
c *= -1 cvxopt minimizes so *-1 to maximize
constraints G*x <= h
G = self.build_ce_constraints(A=A)
G = np.vstack([G, np.eye(num_vars) * -1]) > 0 constraint for all vars
h_size = len(G)
G = matrix(G)
h = [0 for i in range(h_size)]
h = np.array(h, dtype="float")
h = matrix(h)
contraints Ax = b
A = [1 for i in range(num_vars)]
A = np.matrix(A, dtype="float")
A = matrix(A)
b = np.matrix(1, dtype="float")
b = matrix(b)
sol = solvers.lp(c=c, G=G, h=h, A=A, b=b, solver=solver)
return sol</pre>
<pre>def build_ce_constraints(self, A):
num_vars = int(len(A) ** (1/2))
G = []
row player
for i in range(num_vars): action row i
for j in range(num_vars): action row j
if i != j:
constraints = [0 for i in A]
base_idx = i * num_vars
comp_idx = j * num_vars
for k in range(num_vars):
constraints[base_idx+k] = (- A[base_idx+k][0]
+ A[comp_idx+k][0])
G += [constraints]
col player
for i in range(num_vars): action column i
for j in range(num_vars): action column j
if i != j:
constraints = [0 for i in A]
for k in range(num_vars):
constraints[i + (k * num_vars)] = (
- A[i + (k * num_vars)][1]
+ A[j + (k * num_vars)][1])
G += [constraints]
return np.matrix(G, dtype="float")</pre>
<p>Using the helper functions, we solve the Game of Chicken</p>
<pre>sol = ce(A=A, solver="glpk")
probs = sol["x"]
print(probs)
[ 5.00e-01]
[ 2.50e-01]
[ 2.50e-01]
[ 0.00e+00]</pre>
<p>In other words, the optimal strategy is for both players to select actions [6, 6] 50% of the time, actions [2, 7] 25% of the time, and action [7, 2] also 25% of the time.</p>
<p>Hopefully this overview helps in getting you started with linear programming and game theory in Python.</p>
<p>Credits: <a href="http://cvxopt.org/examples/tutorial/lp.html" target="_blank">cvxopt.org/examples/tutorial/lp.html</a><a href="https://www.cs.duke.edu/courses/fall12/cps270/lpandgames.pdf" target="_blank">, cs.duke.edu/courses/fall12/cps270/lpandgames.pdf</a><a href="https://en.wikipedia.org/wiki/Minimax#Example" target="_blank">, en.wikipedia.org/wiki/Minimax#Example</a><a href="https://www3.ul.ie/ramsey/Lectures/Operations_Research_2/gametheory4.pdf" target="_blank">, https://www3.ul.ie/ramsey/Lectures/Operations_Research_2/gametheory4.pdf</a><a href="https://www.cs.rutgers.edu/~mlittman/topics/nips02/nips02/greenwald.ps" target="_blank">, cs.rutgers.edu/~mlittman/topics/nips02/nips02/greenwald.ps</a><a href="https://www.cs.duke.edu/courses/fall16/compsci570/LPandGames.pdf" target="_blank">, cs.duke.edu/courses/fall16/compsci570/LPandGames.pdf</a></p>