Linear Statistical Models

If you could keep only one statistical model, it should be the linear one. Regression is the most-used tool in applied data work — not because it's the most powerful, but because it's interpretable, fast, well-understood, and a genuinely strong baseline. And it's the perfect meeting point of the foundation: the linear algebra of projection, the probability of the error term, and the statistics of inference.

The danger with regression is that it's so easy to run that people skip understanding it. This page is the antidote: not how to fit a line, but what the line means, when it's trustworthy, and how to tell when it isn't.

The model everyone reaches for

Linear regression answers a deceptively rich question: how does an outcome y change as some inputs x change, on average — and how sure are we? Predicting house prices from size and location, sales from ad spend, risk from a handful of indicators: all the same shape. Its appeal is that, unlike a black-box model, every coefficient is a sentence you can say out loud — "an extra bedroom adds about $40k, holding location fixed."

The linear model

The model assumes the outcome is a weighted sum of the inputs, plus random error. In matrix form — stacking all observations — it's compact:

\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}

Here y is the vector of outcomes, X is the design matrix (one row per observation, one column per feature plus a column of ones for the intercept), β is the vector of coefficients we want to learn, and ε is the error term — everything the features don't explain. "Linear" refers to being linear in the coefficients; you can still fit curves by adding x² or interaction columns to X, which is what makes it far more flexible than it first looks.

Ordinary least squares

To fit the model you need the β that makes the line sit closest to the data. Ordinary least squares (OLS) defines "closest" as minimising the sum of squared residuals — the vertical gaps between each point and the line. Squaring punishes big misses hard and makes the maths clean; setting the derivative to zero gives a closed-form answer:

\hat{\boldsymbol{\beta}} = (X^{\top}X)^{-1} X^{\top}\mathbf{y}

This is one of the few models in all of statistics with an exact, one-shot solution — no gradient descent required (though you can use it, and must for huge data). Notice the (XᵀX)⁻¹: if two features are perfectly correlated, XᵀX is not invertible — the same rank problem from the linear algebra page, surfacing here as multicollinearity.

OLS fits the line that minimises the total squared length of the residuals — the vertical gaps from each point to the line.

The geometry of OLS

The formula hides a beautiful geometric truth that ties straight back to linear algebra. Think of the outcome y as a single point in a high-dimensional space. All the outcomes the model can produce — every Xβ — form a flat subspace (the column space of X). Usually y doesn't lie in that subspace; there's no perfect fit.

OLS finds the point in the subspace closest to y — and the closest point is the orthogonal projection of y onto it. The prediction ŷ is that projection, and the residual y − ŷ is perpendicular to the subspace. That's why least squares works: minimising squared distance is dropping a perpendicular. The whole method is the projection from the linear algebra page, wearing a statistics hat.

The assumptions

OLS always returns a line, but its guarantees — and the validity of every p-value it produces — rest on assumptions, the Gauss-Markov conditions:

Linearity — the true relationship really is linear in the coefficients.
Independence — the errors don't depend on each other (violated by time series and clustered data).
Homoskedasticity — the errors have constant variance, not fanning out as x grows.
No perfect multicollinearity — no feature is an exact combination of others (so XᵀX inverts).

When these hold, OLS is BLUE — the Best Linear Unbiased Estimator, the lowest-variance unbiased linear estimator there is. Add the assumption that errors are normally distributed and the t-tests and confidence intervals below become exactly valid. Knowing these is what separates "I ran a regression" from "I trust this regression".

Reading the coefficients

Each coefficient βⱼ has a precise meaning: the expected change in y for a one-unit increase in xⱼ, holding all other features fixed. That "holding others fixed" clause is the quiet superpower of multiple regression — it estimates each effect controlling for the rest, which is how you separate genuine drivers from confounders.

Inference and fit

Because the coefficients are estimated from a sample, they're uncertain — and the statistics page tools apply directly. Each β̂ⱼ comes with a standard error; a t-test asks whether it's distinguishable from zero (its p-value), and a confidence interval gives its plausible range. A coefficient that looks big but has a huge standard error is not real signal.

For overall fit, R² reports the share of the variance in y the model explains:

R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}

R² of 0 means the model does no better than predicting the mean; 1 means a perfect fit. But beware: R² only ever rises as you add features, even useless ones, so for model comparison you use adjusted R² (which penalises extra terms) — the same overfitting caution from the machine learning page.

Diagnostics and extensions

A fitted model isn't finished until you've checked it. The single best tool is a residual plot — plot the leftover errors and look for what should not be there. A curve in the residuals means you missed non-linearity; a fanning shape means heteroskedasticity; clusters mean dependence. The residuals should look like featureless noise; any pattern is the model telling you what it got wrong.

When the assumptions break, the model family extends to match:

Logistic regression — for a yes/no outcome, model the log-odds linearly. The gateway to classification.
Generalised linear models (GLMs) — the same linear core with a link function, covering counts (Poisson) and other non-normal outcomes.
Regularised regression — Ridge and Lasso add the penalty from the ML page to tame variance and handle correlated features.