Machine Learning Notes
This alphabetically sorted collection of AI, ML, and data resources was last updated on 5/16/2021.
 Algorithms
 AdaBoost: Fits a sequence of weak learners on repeatadly modified data. The modifications are based on errors made by previous learners. scikit tutorial
 Classification: scikit comparison
 Expectationmaximization (EM): algo assumes random components and computes for each point a probability of being generated by each component of the model. Then iteratively tweaks the parameters to maximize the likelihood of the data given those assignments. Example: Gaussian Mixture
 Gradient Boosting: optimization of arbitrary differentiable loss functions.
 Kmeans: aims to choose centroids that minimize the inertia, or withincluster sumofsquares criterion. Use the “elbow” method to identify the right number of means. scikit tutorial
 KNN: Simple, flexible, naturally handles multiple classes. Slow at scale, sensitive to feature scaling and irrelevant features. scikit tutorial
 Linear Discriminant Analysis (LDA): A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix. scikit tutorial
 Naive Bayes: uses naive conditional independence assumption of features. scikit
 PCA: transform data using k vectors that minimize the perpendicular distance to points. PCA can be also thought of as an eigenvalue/engenvector decomposition. scikit. Intuition paper
 Random Forests: each tree is built using a sample of rows (with replacement) from training set. Less prone to overfitting. scikit
 Stochastic gradient descent tutorial. Calculus solution:
 SVD: Singular Value Decomposition intuition with PCA use case
 SVM: Effective in high dimensional spaces (or when number of dimensions > number of examples). SVMs do not directly provide probability estimates. scikit
 Anomaly detection
 Future examples may look nothing like the past. This is where supervised learning differs because it assumes that future examples fall within the range of the training data. Using Gaussian Mixtures for anomaly detection
 Bayes
 Explainability and bias evaluation tutorial
 Partial dependence plots (PDP): xaxis = value of a single feature, yaxis = label. scikit
 Individual conditional expectation (ICE): xaxis = value of a single feature, yaxis = label. scikit
 Permutation feature importance: Randomly shuffle features and calculate impact on model metrics such as F1. scikit
 Global surrogate: train an easily interpretable model (such as liner regression) on the predictions made by a black box model
 Local Surrogate: LIME (for Local Interpretable Modelagnostic Explanations). Train individual models to approximate an individual prediction by removing features to learn their impact on the prediction
 Shapley Value (SHAP): The contribution of each feature is measured by adding and removing it from all other feature subsets. The Shapley Value for one feature is the weighted sum of all its contributions
 Infrastructure
 Data (source a16z)
 ML
 Learning curves scikit tutorial
 Linear regression
 assumptions (LINE) source
 Linearity
 Independence of errors
 Normality of errors
 Equal variances
 Tests of assumptions: i) plot each feature on xaxis vs y_error, ii) plot y_predicted on xaxis vs y_error, iii) histogram of errors.
 Overspecified model can be used for prediction of the label, but should not be used to ascribe the effect of a feature on the label.
 Linear algebra solution
 assumptions (LINE) source
 ML types
 Means
 Arithmetic: wolfram
 Geometric: used in finance to calculate average growth rates and is referred to as the compounded annual growth rate. wolfram
 Harmonic: used in finance to average multiples like the priceearnings ratio because it gives equal weight to each data point. Using a weighted arithmetic mean to average these ratios would give greater weight to high data points than low data points because priceearnings ratios aren't pricenormalized while the earnings are equalized. wolfram
 ML Lifecycle AWS blog
 Model evaluation metrics
 Classification:
 Regression
 R2: strength of a linear relationship. Could be 0 for nonlinear relationships. Never worsens with more features. scikit
 Overfitting and regularization
 Overfitting (high variance) options: more data, increase regularization, or decrease model complexity. tutorial
 Lasso regression: linear model regularization technique with tendency to prefer solutions with fewer nonzero coefficients. scikit tutorial.
 Ridge regression: imposes a penalty on the size of the coefficients scikit
 Validation curve: scikit
 Pearson’s correlation coefficient. wiki.

Preprocessing scikit

Analysis
 Remove duplicates
 SOCS of each feature: Shape (skew), Outliers, Center, Spread
 Feature correlation

Production pipeline
 Outliers: remove or apply nonlinear transformations
 Missing values
 SMOTE: Generate and place a new point on the vector between a minority class point and one of its nearest neighbors, located [0, 1] percent of the way from the original point. Algorithm is parameterized with k_neighbors. tutorial
 Standardization
 Discretization
 Encoding categorical features
 Generating polynomial features
 Dimensionality reduction


Probability distributions Description acronym SOCS: shape, outliers, center, spread. Comparison article.
 Beta: probability distribution on probabilities bounded [0, 1]. tutorial
 Binomial: probability of obtaining k successes in n binomial experiments with probability p. tutorial
 Normal: empirical rule is sometimes called the 689599.7 rule
 Poisson: the probability of obtaining k successes during a given time interval. tutorial
 Reinforcement learning
 Sample variance: divided by n1 to achieve an unbiased estimator because 1 degree of freedom is used to estimate b0. tutorial
 Sorting: tutorial.
 SQL tricks
 Statistics: Statology tutorial
 Statistical tests
 ANOVA: Analysis of variance compares the means of three or more independent groups to determine if there is a statistically significant difference between the corresponding population means. Statology tutorial
 Fstatistic: determines whether to reject a full model (F) in favor of a reduced (R) model. Reject full model if F is large — or equivalently if its associated pvalue is small. tutorial
 Linear regression coefficient CI: tutorial
 Ttest: tutorial
 Underfitting (high bias)
 Options: decrease regularization, increase model complexity