Statistical Machine Learning

Machine learning is what you do when the rules are too complex to write by hand. Instead of programming the answer, you show a model many examples and let it infer the pattern — then you hope it works on examples it has never seen. That last clause is the entire discipline: not fitting the data you have, but generalising to the data you don't.

This is the advanced page that pulls the whole foundation together. It runs on linear algebra (the data and the models are vectors and matrices), probability and statistics (every prediction is uncertain, every model is estimated), and calculus (training is minimising a loss). Here we assemble them into the thing that learns.

What 'learning' means

Learning here has a precise meaning: improving at a task as you see more data, measured by some performance metric. The field splits by what the data looks like:

Supervised learning — you have labelled examples (input → correct answer) and learn to predict the label. Classification predicts a category (spam / not-spam); regression predicts a number (house price). The bulk of applied ML.
Unsupervised learning — no labels, just structure to find: clustering groups similar points, dimensionality reduction (like PCA) compresses them.
Reinforcement learning — an agent learns by acting and receiving rewards. Different enough to leave for its own page.

The learning problem

Stripped to its skeleton, supervised learning is three choices:

A hypothesis space — the family of functions you'll consider (all straight lines, all trees of depth 5, all neural nets of a given shape). This is your model choice.
A loss function — how wrong a single prediction is (squared error for regression, cross-entropy for classification).
An optimiser — the search for the function in that space with the lowest total loss, usually by gradient descent.

What you actually want to minimise is the risk — the expected loss on new data drawn from the real world:

R(f) = \mathbb{E}_{(x, y)}\!\left[\, L(f(x), y) \,\right]

But you can't see the whole world — only your sample. So you minimise the empirical risk, the average loss on your training set, and pray it tracks the true risk. The entire art is in making that prayer come true.

Generalisation, not memorisation

A model that aces the training data has proven nothing — it might have just memorised it. The only test that matters is performance on data it has never seen. So the first rule of ML is to hold out a test set and never let the model learn from it. Two failure modes bracket the goal:

Underfitting — the model is too simple to capture the pattern. High error on both training and test data. (A straight line through a curve.)
Overfitting — the model is so flexible it has fit the noise as well as the signal. Low training error, high test error. It memorised instead of learning.

Underfit (left): too rigid to follow the trend. Good fit (centre): captures the signal, ignores the wiggles. Overfit (right): contorts through every point, including the noise — and fails on new data.

The bias–variance tradeoff

Those two failures are the two ends of the most important idea in ML. A model's expected error decomposes into three parts:

\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise}

Bias — error from wrong assumptions; the model is too simple to represent the truth. High bias = underfitting.
Variance — error from sensitivity to the particular training sample; the model changes wildly if you reshuffle the data. High variance = overfitting.
Irreducible noise — the randomness in the world itself. No model can beat it; pretending otherwise is overfitting.

The tension is fundamental: making a model more flexible lowers bias but raises variance, and vice versa. You can't drive both to zero — you tune for the sweet spot where their sum is smallest.

As model complexity grows, training error falls forever, but test error falls then rises. The minimum of the test curve — the balance point of bias and variance — is the model you want.

Regularisation

Regularisation is the main lever for controlling that tradeoff: deliberately constrain the model so it can't contort itself to fit noise. You add a penalty on complexity to the loss, so training has to balance fitting the data against staying simple:

\min_{\theta}\ \ L(\text{data}) + \lambda \cdot \text{penalty}(\theta)

The strength λ is a dial from "fit hard" to "stay simple". Two classic penalties on the weights:

L2 (Ridge) — penalises the squared size of the weights, shrinking them all smoothly toward zero. Tames variance without dropping features.
L1 (Lasso) — penalises the absolute size, which drives some weights exactly to zero — doing automatic feature selection. Handy when you suspect most features are useless.

It's the formal version of Occam's razor: among models that fit the data, prefer the simplest, because simple models generalise.

Cross-validation

You need an honest estimate of test performance to tune choices like λ — but every peek at the test set burns it. The fix is cross-validation: split the training data into k folds, train on k−1 and validate on the one held out, then rotate so each fold is the validation set once. Average the k scores.

This squeezes a reliable performance estimate out of limited data, and it's how you choose hyperparameters without contaminating the final test set — which stays in a vault, touched once, at the very end. The discipline here is the same one from the statistics page: never let information leak from test into training.

The model families

A practical toolkit, from interpretable to powerful:

Linear / logistic regression — weighted sums of features. Fast, interpretable, a convex loss, and a genuinely strong baseline. Start here.
Decision trees — nested yes/no splits. Readable, but a single tree overfits.
Ensembles — combine many weak models into a strong one. Random forests average many de-correlated trees (reducing variance); gradient boosting (XGBoost, LightGBM) builds trees that fix each other's errors and wins a large share of tabular problems.
Support Vector Machines — find the widest-margin boundary, and via the kernel trick draw non-linear boundaries cheaply.
k-Nearest Neighbours — predict from the closest training points. No training, but slow and weak in high dimensions.
Neural networks — stacked non-linear layers; unbeatable on images, text, and audio, at the cost of data, compute, and interpretability.

Evaluating honestly

A single accuracy number lies, especially with imbalanced classes — the lesson from the statistics and NLP pages carries straight over. Use precision, recall and F1 for classification; inspect the confusion matrix to see which errors you make; use a ROC curve / AUC to judge across thresholds; and for regression report RMSE or R².

Above all, evaluate on data the model has never touched, match the metric to the real-world cost of each error, and remember the bias–variance lesson: the model with the best training score is rarely the one you want.

Where it shows up in my work

Refresh in 60 seconds

ML learns patterns from examples to generalise to unseen data — that, not fitting the training set, is the whole goal.
The learning problem = hypothesis space + loss + optimiser; you minimise empirical risk hoping it tracks true risk.
Underfit (too simple, high bias) vs overfit (too flexible, high variance). Error = Bias² + Variance + noise — tune for the minimum of their sum.
Regularisation (L2 shrinks, L1 selects) penalises complexity; cross-validation estimates performance and tunes hyperparameters without touching the test set.
Know the families: linear → trees → ensembles (boosting wins tabular) → SVM → kNN → neural nets. No free lunch; baseline first.
Evaluate honestly on held-out data with the right metric (precision/recall/F1, AUC, RMSE) — never training accuracy alone.