/knowledge/statistics
Statistics: Estimation & Inference
Probability runs forwards — from a known model to the data it produces. Statistics runs backwards — from the data you actually have to the model that produced it. That reverse direction is the whole job.
- Studied
- Statistics — estimation & inferenceBachelor of Science · Data Science core
- When
- UniMelb, 2019–2022
- Applied in
- A/B testing · intelligence reporting
- Read / Refreshed
- ~16 min read2026-06-25
You never get to see the whole population. You get a sample — 1,000 customers out of millions, last month's tickets, the people who answered the survey — and you have to say something trustworthy about the whole from that sliver. Statistical inference is the discipline of doing that honestly: drawing conclusions about a population from a sample, and being precise about how uncertain those conclusions are.
This page builds on probability — which gave us distributions and the Central Limit Theorem — and turns it around. Probability asks "given this coin is fair, what will I see?" Statistics asks the harder, more useful question: "given what I saw, is this coin fair?"
01
The inverse problem
The cleanest way to hold the two fields apart: probability reasons from model to data, statistics reasons from data to model.
- Probability (forward). Known model → predict the data. "A fair die: P(two sixes in a row) = 1/36."
- Statistics (inverse). Observed data → infer the model. "I rolled twenty sixes in a row — is this die fair?"
The inverse direction is harder because many models could have produced the same data, and randomness means even a fair process throws up strange samples. So inference is never about certainty — it's about quantifying how much the data should move your conclusion, and how much doubt remains.
02
Samples and the standard error
A statistic is any number computed from a sample — the sample mean x̄, a proportion, a correlation. The key realisation that unlocks all of inference: a statistic is itself random. Draw a different sample and you'd get a slightly different mean. The distribution of a statistic across all possible samples is its sampling distribution.
Its spread — how much your estimate jumps around from sample to sample — is the standard error. For a sample mean it shrinks with the square root of the sample size:
That √n is one of the most important facts in applied stats: to halve your uncertainty you need four times the data, not twice. It's why early samples improve an estimate fast and later ones barely move it — and why "just collect more data" has sharply diminishing returns.
03
Point estimation
A point estimate is a single best guess at an unknown population value (a parameter) — the sample mean estimating the population mean. We judge estimators by two properties:
- Unbiased — right on average. Across many samples the estimates centre on the true value rather than systematically over- or under-shooting.
- Consistent — it converges to the truth as the sample grows (the Law of Large Numbers at work).
The workhorse method for building good estimators is Maximum Likelihood Estimation (MLE): pick the parameter values that make the observed data most probable. "Given I saw this data, which model was most likely to have generated it?" MLE is the engine inside logistic regression, most of classical modelling, and — not coincidentally — a lot of machine learning, where the loss function is often just a negative log-likelihood in disguise.
04
Confidence intervals
A point estimate alone is overconfident — it hides how much the answer could have wobbled. A confidence interval attaches a range, built from the standard error:
The 1.96 comes straight from the normal curve — 95% of a bell's mass lies within 1.96 standard deviations of centre. But the interpretation is the most misunderstood idea in statistics:
05
Hypothesis testing
Hypothesis testing is a formal way to ask "is this effect real, or could it just be noise?" The structure is deliberately conservative, like a courtroom that presumes innocence:
- State a null hypothesis
H₀— the boring default, "no effect", "the coin is fair", "the new design changed nothing". - State an alternative
H₁— "there is an effect". - Compute a test statistic measuring how far the data sit from what
H₀predicts. - Compute the p-value and compare it to a threshold
α(usually 0.05).
The p-value is the single most abused number in science, so be exact about it: it is the probability of seeing data at least this extreme if the null hypothesis were true. A small p-value means the data would be surprising under "no effect", so you reject H₀.
06
Type I, Type II, and power
Because inference works from limited data, you will sometimes be wrong in two distinct ways:
- Type I error (false positive) — rejecting a true null. You declared an effect that isn't there. Its rate is
α, the threshold you chose. - Type II error (false negative) — failing to reject a false null. There was a real effect and you missed it. Its rate is
β.
A test's power is 1 − β: the chance of catching an effect that's genuinely there. The tension is permanent — tighten α to avoid false alarms and you raise β, missing more real effects. The main lever that improves both is sample size, which is exactly what a power analysis computes before you run a study.
07
The multiple-comparisons trap
If you test one hypothesis at α = 0.05, there's a 5% chance of a false positive. Test twenty independent hypotheses and the chance that at least one lights up by pure luck is about 64%. Run enough tests and you're almost guaranteed a "significant" result that means nothing.
This is p-hacking (or data dredging): slicing the data many ways, trying many variables, and reporting only the comparison that crossed 0.05. It's usually not fraud — it's the natural result of looking hard and stopping at the first win. The defences are real: decide your hypotheses before looking, correct the threshold when you run many tests (e.g. Bonferroni: divide α by the number of tests), and hold out data to confirm a finding you discovered.
08
Frequentist vs Bayesian
Everything above is the frequentist tradition: parameters are fixed-but-unknown, probability is long-run frequency, and you reason about the procedure (p-values, confidence intervals). It's the default in most fields and most A/B testing.
The Bayesian alternative treats the unknown parameter as itself having a probability distribution. You start with a prior, apply Bayes' rule with the data's likelihood, and get a posterior — a full distribution of belief. Its credible interval means the intuitive thing people wrongly want a confidence interval to mean: "95% probability the parameter is in here." Bayesian methods shine with small data, prior knowledge worth encoding, or when you need to act on a probability directly. Neither school is "right" — they answer slightly different questions, and a good analyst uses both.
09
Where it shows up in my work
10