Machine Learning Notes
This alphabetically sorted collection of AI, ML, and data resources was last updated on 3/26/2021.
 AdaBoost: Fits a sequence of weak learners on repeatadly modified data. The modifications are based on errors made by previous learners.
 Analysis of variance ANOVA
 Bayesian modelling
 Beta Distribution: probability distribution on probabilities bounded [0, 1]
 Classification algorithms comparison
 Confidence interval: linear regression coefficient
 Data and ML Infrastructure (a16z)
 Expectationmaximization (EM): assumes random components and computes for each point a probability of being generated by each component of the model. Then iteratively tweaks the parameters to maximize the likelihood of the data given those assignments. Example: Gaussian Mixture

Explainability tutorial:
 Partial dependence plots (PDP): xaxis = value of a single feature, yaxis = label
 Individual conditional expectation (ICE): xaxis = value of a single feature, yaxis = label
 Permutation feature importance: Randomly shuffle features and calculate impact on model metrics such as F1
 Global surrogate: train an easily interpretable model (such as liner regression) on the predictions made by a black box model
 Local Surrogate: LIME (for Local Interpretable Modelagnostic Explanations). Train individual models to approximate an individual prediction by removing features to learn their impact on the prediction.
 Shapley Value (SHAP): The contribution of each feature is measured by adding and removing it from all other feature subsets. The Shapley Value for one feature is the weighted sum of all its contributions
 Fstatistic: determines whether to reject a full model (F) in favor of a reduced (R) model. Reject full model if F is large — or equivalently if its associated pvalue is small
 Gradient Boosting: optimization of arbitrary differentiable loss functions. — Risk of overfitting
 KNN: + Simple, flexible, naturally handles multiple classes. — Slow at scale, sensitive to feature scaling and irrelevant features
 Kmeans: aims to choose centroids that minimize the inertia, or withincluster sumofsquares criterion. Use the “elbow” method to identify the right number of means
 Lasso: linear model regularization technique with tendency to prefer solutions with fewer nonzero coefficients
 Learning Curve
 Linear Discriminant Analysis (LDA): A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix.
 Linear regression assumptions (LINE): 1) Linearity, 2) Independence of errors, 3) Normality of errors, 4) Equal variances. Tests of assumptions: i) plot each feature on xaxis vs y_error, ii) plot y_predicted on xaxis vs y_error, iii) histogram of errors

Means
 Arithmetic
 Geometric: used in finance to calculate average growth rates and is referred to as the compounded annual growth rate
 Harmonic: used in finance to average multiples like the priceearnings ratio because it gives equal weight to each data point. Using a weighted arithmetic mean to average these ratios would give greater weight to high data points than low data points because priceearnings ratios aren't pricenormalized while the earnings are equalized
 Overfitting, biasvariance and learning curves. Overfitting (high variance) options: more data, increase regularization, or decrease model complexity
 Overspecified model: can be used for prediction of the label, but should not be used to ascribe the effect of a feature on the label
 PCA: transform data using k vectors that minimize the perpendicular distance to points. PCA can be also thought of as an eigenvalue/engenvector decomposition. Intuition paper
 Pearson’s correlation coefficient
 Receiver operating characteristic (ROC): relates true positive rate (yaxis) and false positive rate (xaxis). A confusion matrix defines TPR = TP / (TP + FN) and FPR = FP / (FP + TN)
 Probability distributions
 Naive Bayes
 Normal Equation
 Random Forests: each tree is built using a sample of rows (with replacement) from training set. + Less prone to overfitting
 Ridge Regression regularization: imposes a penalty on the size of the coefficients
 R2: strength of a linear relationship. Could be 0 for nonlinear relationships. Never worsens with more features
 Sample variance: divided by n1 to achieve an unbiased estimator because 1 degree of freedom is used to estimate b0
 SMOTE algorithm is parameterized with k_neighbors. Generate and place a new point on the vector between a minority class point and one of its nearest neighbors, located [0, 1] percent of the way from the original point
 Sorting algorithms
 SQL tricks
 window functions, row_number() and partition()
 COALESCE(): evaluates the arguments in order and returns the current value of the first expression that initially doesn’t evaluate to NULL
 SVD (Singular Value Decomposition) + PCA intuition
 SVM: Effective in high dimensional spaces (or when number of dimensions > number of examples). SVMs do not directly provide probability estimates
 Stochastic gradient descent
 Ttest
<li><a href="https://scikitlearn.org/stable/auto_examples/model_selection/plot_validation_curve.html#plottingvalidationcurves" target="_blank">Validation curve</a>
<a href="/theme/images/1*HVM4sFhGDTNE40xr5aVCiQ.png.png"><img src="/theme/images/1*HVM4sFhGDTNE40xr5aVCiQ.png.png" alt="validation curve example" style="width: 100%" loading="lazy"></a>