Machine Learning Notes

This alphabetically sorted collection of AI, ML, and data resources was last updated on 5/16/2021.

  • Algorithms
    • AdaBoost: Fits a sequence of weak learners on repeatadly modified data. The modifications are based on errors made by previous learners. scikit tutorial
    • Classification: scikit comparison Classifier comparison: scikit-learn.org
    • Expectation-maximization (EM): algo assumes random components and computes for each point a probability of being generated by each component of the model. Then iteratively tweaks the parameters to maximize the likelihood of the data given those assignments. Example: Gaussian Mixture
    • Gradient Boosting: optimization of arbitrary differentiable loss functions.
    • K-means: aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion. Use the “elbow” method to identify the right number of means. scikit tutorial
    • KNN: Simple, flexible, naturally handles multiple classes. Slow at scale, sensitive to feature scaling and irrelevant features. scikit tutorial
    • Linear Discriminant Analysis (LDA): A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix. scikit tutorial
    • Naive Bayes: uses naive conditional independence assumption of features. scikit
    • PCA: transform data using k vectors that minimize the perpendicular distance to points. PCA can be also thought of as an eigenvalue/engenvector decomposition. scikit. Intuition paper
    • Random Forests: each tree is built using a sample of rows (with replacement) from training set. Less prone to overfitting. scikit
    • Stochastic gradient descent tutorial. Calculus solution: Stochastic gradient descent cost function
    • SVD: Singular Value Decomposition intuition with PCA use case
    • SVM: Effective in high dimensional spaces (or when number of dimensions > number of examples). SVMs do not directly provide probability estimates. scikit
  • Anomaly detection
    • Future examples may look nothing like the past. This is where supervised learning differs because it assumes that future examples fall within the range of the training data. Using Gaussian Mixtures for anomaly detection
  • Bayes
  • Explainability and bias evaluation tutorial
    • Partial dependence plots (PDP): x-axis = value of a single feature, y-axis = label. scikit
    • Individual conditional expectation (ICE): x-axis = value of a single feature, y-axis = label. scikit
    • Permutation feature importance: Randomly shuffle features and calculate impact on model metrics such as F1. scikit
    • Global surrogate: train an easily interpretable model (such as liner regression) on the predictions made by a black box model
    • Local Surrogate: LIME (for Local Interpretable Model-agnostic Explanations). Train individual models to approximate an individual prediction by removing features to learn their impact on the prediction
    • Shapley Value (SHAP): The contribution of each feature is measured by adding and removing it from all other feature subsets. The Shapley Value for one feature is the weighted sum of all its contributions
  • Infrastructure
  • Learning curves scikit tutorial Learning Curve example
  • Linear regression
    • assumptions (LINE) source
      • Linearity
      • Independence of errors
      • Normality of errors
      • Equal variances
      • Tests of assumptions: i) plot each feature on x-axis vs y_error, ii) plot y_predicted on x-axis vs y_error, iii) histogram of errors.
    • Overspecified model can be used for prediction of the label, but should not be used to ascribe the effect of a feature on the label.
    • Linear algebra solutionNormal equation
  • ML types ML breakdown: Supervised + Unsupervised + RL
  • Means
    • Arithmetic: wolfram
    • Geometric: used in finance to calculate average growth rates and is referred to as the compounded annual growth rate. wolfram
    • Harmonic: used in finance to average multiples like the price-earnings ratio because it gives equal weight to each data point. Using a weighted arithmetic mean to average these ratios would give greater weight to high data points than low data points because price-earnings ratios aren't price-normalized while the earnings are equalized. wolfram
  • ML Lifecycle AWS blog ML lifecycle
  • Model evaluation metrics
    • Classification:
      • Recall: wiki
      • Receiver operating characteristic (ROC): relates true positive rate (y-axis) and false positive rate (x-axis). TPR = TP / (TP + FN) and FPR = FP / (FP + TN). scikit
    • Regression
      • R2: strength of a linear relationship. Could be 0 for nonlinear relationships. Never worsens with more features. scikit
  • Overfitting and regularization
    • Overfitting (high variance) options: more data, increase regularization, or decrease model complexity. tutorial
    • Lasso regression: linear model regularization technique with tendency to prefer solutions with fewer non-zero coefficients. scikit tutorial. Lasso equation
    • Ridge regression: imposes a penalty on the size of the coefficients Ridge Regressionscikit
    • Validation curve: scikitvalidation curve example
  • Pearson’s correlation coefficient. wiki. Correlation formula
  • Preprocessing scikit

    1. Analysis

      1. Remove duplicates
      2. SOCS of each feature: Shape (skew), Outliers, Center, Spread
      3. Feature correlation
    2. Production pipeline

      1. Outliers: remove or apply non-linear transformations
      2. Missing values
        • SMOTE: Generate and place a new point on the vector between a minority class point and one of its nearest neighbors, located [0, 1] percent of the way from the original point. Algorithm is parameterized with k_neighbors. tutorial
      3. Standardization
      4. Discretization
      5. Encoding categorical features
      6. Generating polynomial features
      7. Dimensionality reduction
  • Probability distributions Description acronym SOCS: shape, outliers, center, spread. Comparison article. Correlation formula

    • Beta: probability distribution on probabilities bounded [0, 1]. tutorial
    • Binomial: probability of obtaining k successes in n binomial experiments with probability p. tutorial
    • Normal: empirical rule is sometimes called the 68-95-99.7 rule
    • Poisson: the probability of obtaining k successes during a given time interval. tutorial
  • Reinforcement learning Reinforcement learning
  • Sample variance: divided by n-1 to achieve an unbiased estimator because 1 degree of freedom is used to estimate b0. tutorial
  • Sorting: tutorial. Ridge Regression
  • SQL tricks
    • window functions, row_number() and partition(): tutorial
    • COALESCE(): evaluates the arguments in order and returns the current value of the first expression that initially doesn’t evaluate to NULL. tutorial
  • Statistics: Statology tutorial
  • Statistical tests Selecting statistical test. Source: Statistical Rethinking 2. Free Chapter 1
    • ANOVA: Analysis of variance compares the means of three or more independent groups to determine if there is a statistically significant difference between the corresponding population means. Statology tutorial
    • F-statistic: determines whether to reject a full model (F) in favor of a reduced (R) model. Reject full model if F is large — or equivalently if its associated p-value is small. tutorialF-statistic
    • Linear regression coefficient CI: tutorialt-interval for slope parameter beta_1
    • T-test: tutorialT-test formula
  • Transformers
  • Underfitting (high bias)
    • Options: decrease regularization, increase model complexity