Statistical Modelling

The linear regression page built one powerful model — but it assumes the outcome is a continuous number with normally distributed error. Real outcomes break that constantly: a yes/no decision, a count of events, a rate. Statistical modelling is the framework that keeps the interpretable, linear core of regression while extending it to all of those — a single, unifying idea called the generalised linear model.

This is the statistician's answer to "model anything", and it's the deliberate counterpoint to the machine learning view: where ML optimises for prediction, statistical modelling prizes understanding — coefficients you can interpret and inferences you can defend. Here's how one elegant structure covers an enormous range of data.

Beyond the straight line

Ordinary linear regression makes two assumptions that often don't hold: that the outcome can be any real number, and that its error is normal with constant variance. Try to use it where they fail and it misbehaves — predict a probability and it cheerfully returns 1.4 or −0.3; model a count and it can predict negative events.

The fix isn't a different model for every case — it's one framework that bends to fit. The insight of the GLM is to keep the familiar linear combination of predictors at the core, but connect it to the outcome through two flexible pieces: a choice of distribution for the outcome, and a link that translates between the linear predictor and that distribution's scale.

The generalised linear model

A GLM is built from three components, and once you see them you can construct a model for almost any outcome:

Random component — the probability distribution of the outcome (Normal for continuous, Binomial for yes/no, Poisson for counts). This is your choice about what kind of data you have.
Systematic component — the familiar linear predictor $\eta = X\boldsymbol{\beta}$ , a weighted sum of the features. Unchanged from linear regression.
Link function — a function $g$ connecting the mean of the outcome to the linear predictor.

g\big(\mathbb{E}[y]\big) = \eta = X\boldsymbol{\beta}

The link g connects the outcome's mean to the linear predictor η = Xβ. Pick the distribution and the link, and you have a model.

The link is the clever part. Instead of modelling the mean directly (which might be bounded, like a probability in $[0,1]$ ), you model a transformed mean that can range freely over all real numbers — so the linear predictor is never forced to produce an impossible value. Choose the distribution and the link to match your outcome, and the same machinery fits it. Ordinary linear regression is just the special case: Normal distribution, identity link $g(\mu) = \mu$ .

The GLM's three parts. Features feed a linear predictor (η = Xβ); the link function maps it onto the mean of a chosen outcome distribution. Swapping the distribution + link gives logistic, Poisson, and ordinary regression from one structure.

Logistic regression

The most-used GLM models a binary outcome — yes/no, click/no-click, default/repay. The outcome is Binomial, and the natural link is the logit (the log-odds), which stretches a probability in $[0,1]$ out onto the whole real line:

\log\!\left(\frac{p}{1-p}\right) = X\boldsymbol{\beta}

Run it backwards (the inverse link is the S-shaped logistic function) and any linear predictor maps to a valid probability between 0 and 1 — no more impossible predictions. The coefficients have a clean reading too: each $\beta_j$ is the change in log-odds per unit of $x_j$ , and $e^{\beta_j}$ is an odds ratio — "this factor multiplies the odds by 1.5". It's the workhorse classifier of statistics, and the bridge to the classification models on the ML page.

Poisson regression

For count outcomes — number of support tickets, accidents per intersection, visits per patient — the outcome is Poisson and the link is the log:

\log(\lambda) = X\boldsymbol{\beta}

Modelling the log of the expected count keeps predictions positive (a count can never be negative) and makes the coefficients multiplicative: $e^{\beta_j}$ is the factor by which the rate multiplies per unit of the predictor. Same three-part recipe, different distribution and link — and that's the whole point of the framework. (When counts are more variable than Poisson allows — overdispersion — you reach for the negative-binomial cousin, but the structure is identical.)

Fitting and likelihood

You can't fit a GLM with the tidy closed-form formula that ordinary least squares enjoys. Instead you use maximum likelihood — the same principle from the statistics page: choose the coefficients that make the observed data most probable under the model. There's no algebraic solution, so it's found numerically by an iterative routine (iteratively reweighted least squares), but conceptually it's simple — turn the dial on $\boldsymbol{\beta}$ until the data looks as likely as possible.

The payoff of the likelihood approach is that it comes with a full inferential toolkit for free: standard errors, confidence intervals, and tests for each coefficient, exactly as on the regression page — so a fitted GLM tells you not just the effect sizes but how sure you can be of them.

Model selection

With a framework this flexible, the danger is building a model that's too complex — fitting the noise, the overfitting problem again. You need a principled way to compare models that rewards fit but penalises complexity. The standard tool is the Akaike Information Criterion:

\text{AIC} = 2k - 2\ln(\hat{L})

Here $\ln(\hat{L})$ measures how well the model fits (the maximised log-likelihood) and $k$ is the number of parameters — so AIC trades goodness-of-fit against complexity, and lower is better. Adding a useless predictor improves fit a little but costs $2$ in the penalty, so AIC only keeps it if it earns its place. The close relative BIC penalises parameters more harshly (it scales the penalty by sample size), favouring simpler models. Both are formal expressions of Occam's razor — the same parsimony instinct as regularisation, in a different guise.

Diagnostics and fit

A fitted GLM still needs checking. The analogue of the residual sum of squares is the deviance — a measure, built from the likelihood, of how far the model's fit falls short of a perfect one; lower deviance is better fit, and comparing deviances formally tests whether an added term helps. As on the regression page, you also inspect residuals (specially defined for GLMs) for leftover patterns the model missed, and watch for influential points distorting the fit. The discipline is the same: the model isn't done until you've looked at what it got wrong.

When data has structure

GLMs assume observations are independent — but often they're not. Repeated measurements on the same patient, students within the same school, readings from the same sensor: these are grouped, and ignoring that structure understates your uncertainty. Mixed-effects (or hierarchical) models extend the framework with random effects — group-level terms that let each cluster have its own adjustment while still sharing overall structure. It's how you honestly model nested, correlated data, and it connects directly to the Bayesian hierarchical view. The unifying message: pick the distribution, link, and grouping that match how the data was actually generated.

Where it shows up in my work

The interpretable workhorse for real outcomes

Real outcomes are rarely tidy continuous numbers, and GLMs are how I model the ones that aren't. Logistic regression for a yes/no outcome — will this case escalate, did this intervention work — is a constant, precisely because its odds ratios are something I can put in front of a decision-maker and explain. Poisson models for counts and rates show up wherever the question is "how often". The framing that matters: statistical modelling optimises for interpretation and inference, not raw prediction — so when the goal is to understand and defend a relationship rather than just forecast it, this is the right tool, and a black-box model is the wrong one.

It also ties the statistics pages together: it generalises linear regression, runs on maximum likelihood, and shares its parsimony logic with both regularisation and the Bayesian view.

Refresh in 60 seconds

Linear regression assumes a continuous, normal outcome. GLMs generalise it to counts, yes/no, and rates with one framework.
Three parts: a distribution (random), the linear predictor $\eta = X\boldsymbol{\beta}$ (systematic), and a link $g(\mathbb{E}[y]) = \eta$ .
Logistic: Binomial + logit link $\log\frac{p}{1-p}=X\boldsymbol{\beta}$ → probabilities & odds ratios. Poisson: log link $\log\lambda=X\boldsymbol{\beta}$ → counts.
Fit by maximum likelihood(iterative); get standard errors & tests for free.
Compare models with AIC $=2k-2\ln\hat{L}$ / BIC (fit vs complexity, lower is better). Check deviance& residuals.
Grouped/correlated data → mixed-effects (random effects). Statistical modelling prizes interpretation over prediction.