Statistics: Estimation & Inference

You never get to see the whole population. You get a sample — 1,000 customers out of millions, last month's tickets, the people who answered the survey — and you have to say something trustworthy about the whole from that sliver. Statistical inference is the discipline of doing that honestly: drawing conclusions about a population from a sample, and being precise about how uncertain those conclusions are.

This page builds on probability — which gave us distributions and the Central Limit Theorem — and turns it around. Probability asks "given this coin is fair, what will I see?" Statistics asks the harder, more useful question: "given what I saw, is this coin fair?"

The inverse problem

The cleanest way to hold the two fields apart: probability reasons from model to data, statistics reasons from data to model.

Probability (forward). Known model → predict the data. "A fair die: P(two sixes in a row) = 1/36."
Statistics (inverse). Observed data → infer the model. "I rolled twenty sixes in a row — is this die fair?"

The inverse direction is harder because many models could have produced the same data, and randomness means even a fair process throws up strange samples. So inference is never about certainty — it's about quantifying how much the data should move your conclusion, and how much doubt remains.

Samples and the standard error

A statistic is any number computed from a sample — the sample mean x̄, a proportion, a correlation. The key realisation that unlocks all of inference: a statistic is itself random. Draw a different sample and you'd get a slightly different mean. The distribution of a statistic across all possible samples is its sampling distribution.

Its spread — how much your estimate jumps around from sample to sample — is the standard error. For a sample mean it shrinks with the square root of the sample size:

\operatorname{SE}(\bar{x}) = \frac{\sigma}{\sqrt{n}}

That √n is one of the most important facts in applied stats: to halve your uncertainty you need four times the data, not twice. It's why early samples improve an estimate fast and later ones barely move it — and why "just collect more data" has sharply diminishing returns.

The sampling distribution of the mean narrows as n grows. Each curve is the spread of x̄ over many samples; quadrupling n halves the standard error.

Point estimation

A point estimate is a single best guess at an unknown population value (a parameter) — the sample mean estimating the population mean. We judge estimators by two properties:

Unbiased — right on average. Across many samples the estimates centre on the true value rather than systematically over- or under-shooting.
Consistent — it converges to the truth as the sample grows (the Law of Large Numbers at work).

The workhorse method for building good estimators is Maximum Likelihood Estimation (MLE): pick the parameter values that make the observed data most probable. "Given I saw this data, which model was most likely to have generated it?" MLE is the engine inside logistic regression, most of classical modelling, and — not coincidentally — a lot of machine learning, where the loss function is often just a negative log-likelihood in disguise.

Confidence intervals

A point estimate alone is overconfident — it hides how much the answer could have wobbled. A confidence interval attaches a range, built from the standard error:

\bar{x} \pm 1.96 \cdot \operatorname{SE}(\bar{x}) \quad (\text{95\% CI})

The 1.96 comes straight from the normal curve — 95% of a bell's mass lies within 1.96 standard deviations of centre. But the interpretation is the most misunderstood idea in statistics:

Hypothesis testing

Hypothesis testing is a formal way to ask "is this effect real, or could it just be noise?" The structure is deliberately conservative, like a courtroom that presumes innocence:

State a null hypothesis H₀ — the boring default, "no effect", "the coin is fair", "the new design changed nothing".
State an alternative H₁ — "there is an effect".
Compute a test statistic measuring how far the data sit from what H₀ predicts.
Compute the p-value and compare it to a threshold α (usually 0.05).

The p-value is the single most abused number in science, so be exact about it: it is the probability of seeing data at least this extreme if the null hypothesis were true. A small p-value means the data would be surprising under "no effect", so you reject H₀.

Type I, Type II, and power

Because inference works from limited data, you will sometimes be wrong in two distinct ways:

Type I error (false positive) — rejecting a true null. You declared an effect that isn't there. Its rate is α, the threshold you chose.
Type II error (false negative) — failing to reject a false null. There was a real effect and you missed it. Its rate is β.

A test's power is 1 − β: the chance of catching an effect that's genuinely there. The tension is permanent — tighten α to avoid false alarms and you raise β, missing more real effects. The main lever that improves both is sample size, which is exactly what a power analysis computes before you run a study.

The two error types. Under H₀ (left) the shaded tail past the threshold is the Type I rate α — false positives. Under H₁ (right) the overlap below the threshold is the Type II rate β — missed real effects. Power is the rest of the H₁ curve.

The multiple-comparisons trap

If you test one hypothesis at α = 0.05, there's a 5% chance of a false positive. Test twenty independent hypotheses and the chance that at least one lights up by pure luck is about 64%. Run enough tests and you're almost guaranteed a "significant" result that means nothing.

This is p-hacking (or data dredging): slicing the data many ways, trying many variables, and reporting only the comparison that crossed 0.05. It's usually not fraud — it's the natural result of looking hard and stopping at the first win. The defences are real: decide your hypotheses before looking, correct the threshold when you run many tests (e.g. Bonferroni: divide α by the number of tests), and hold out data to confirm a finding you discovered.

Frequentist vs Bayesian

Everything above is the frequentist tradition: parameters are fixed-but-unknown, probability is long-run frequency, and you reason about the procedure (p-values, confidence intervals). It's the default in most fields and most A/B testing.

The Bayesian alternative treats the unknown parameter as itself having a probability distribution. You start with a prior, apply Bayes' rule with the data's likelihood, and get a posterior — a full distribution of belief. Its credible interval means the intuitive thing people wrongly want a confidence interval to mean: "95% probability the parameter is in here." Bayesian methods shine with small data, prior knowledge worth encoding, or when you need to act on a probability directly. Neither school is "right" — they answer slightly different questions, and a good analyst uses both.

Where it shows up in my work

The discipline of honest conclusions

Inference is the difference between "the numbers went up" and "the numbers went up by more than noise would explain". Reading an A/B test is hypothesis testing end to end — null of "no difference", a test statistic, a p-value, and the discipline to report the effect size and confidence interval, not just whether it cleared 0.05. In intelligence and government reporting, the multiple-comparisons trap is a constant risk — slice any rich dataset enough ways and something looks alarming — so pre-committing to questions and quoting uncertainty is what keeps a brief trustworthy.

The habit this builds is the one that matters most downstream: state the estimate with its uncertainty, distinguish significant from important, and be honest about how many things you tried before you found the one worth reporting.