Skip to content
← Knowledge

/knowledge/statistics

Statistics: Estimation & Inference

Probability runs forwards — from a known model to the data it produces. Statistics runs backwards — from the data you actually have to the model that produced it. That reverse direction is the whole job.

Studied
Statistics — estimation & inferenceBachelor of Science · Data Science core
When
UniMelb, 2019–2022
Applied in
A/B testing · intelligence reporting
Read / Refreshed
~16 min read2026-06-25

You never get to see the whole population. You get a sample — 1,000 customers out of millions, last month's tickets, the people who answered the survey — and you have to say something trustworthy about the whole from that sliver. Statistical inference is the discipline of doing that honestly: drawing conclusions about a population from a sample, and being precise about how uncertain those conclusions are.

This page builds on probability — which gave us distributions and the Central Limit Theorem — and turns it around. Probability asks "given this coin is fair, what will I see?" Statistics asks the harder, more useful question: "given what I saw, is this coin fair?"

01

The inverse problem

The cleanest way to hold the two fields apart: probability reasons from model to data, statistics reasons from data to model.

  • Probability (forward). Known model → predict the data. "A fair die: P(two sixes in a row) = 1/36."
  • Statistics (inverse). Observed data → infer the model. "I rolled twenty sixes in a row — is this die fair?"

The inverse direction is harder because many models could have produced the same data, and randomness means even a fair process throws up strange samples. So inference is never about certainty — it's about quantifying how much the data should move your conclusion, and how much doubt remains.

02

Samples and the standard error

A statistic is any number computed from a sample — the sample mean , a proportion, a correlation. The key realisation that unlocks all of inference: a statistic is itself random. Draw a different sample and you'd get a slightly different mean. The distribution of a statistic across all possible samples is its sampling distribution.

Its spread — how much your estimate jumps around from sample to sample — is the standard error. For a sample mean it shrinks with the square root of the sample size:

SE(x̄) = σ / √n

That √n is one of the most important facts in applied stats: to halve your uncertainty you need four times the data, not twice. It's why early samples improve an estimate fast and later ones barely move it — and why "just collect more data" has sharply diminishing returns.

true meansmall nlarge n
The sampling distribution of the mean narrows as n grows. Each curve is the spread of x̄ over many samples; quadrupling n halves the standard error.

03

Point estimation

A point estimate is a single best guess at an unknown population value (a parameter) — the sample mean estimating the population mean. We judge estimators by two properties:

  • Unbiased — right on average. Across many samples the estimates centre on the true value rather than systematically over- or under-shooting.
  • Consistent — it converges to the truth as the sample grows (the Law of Large Numbers at work).

The workhorse method for building good estimators is Maximum Likelihood Estimation (MLE): pick the parameter values that make the observed data most probable. "Given I saw this data, which model was most likely to have generated it?" MLE is the engine inside logistic regression, most of classical modelling, and — not coincidentally — a lot of machine learning, where the loss function is often just a negative log-likelihood in disguise.

04

Confidence intervals

A point estimate alone is overconfident — it hides how much the answer could have wobbled. A confidence interval attaches a range, built from the standard error:

x̄ ± 1.96 · SE(x̄) (95% CI)

The 1.96 comes straight from the normal curve — 95% of a bell's mass lies within 1.96 standard deviations of centre. But the interpretation is the most misunderstood idea in statistics:

05

Hypothesis testing

Hypothesis testing is a formal way to ask "is this effect real, or could it just be noise?" The structure is deliberately conservative, like a courtroom that presumes innocence:

  1. State a null hypothesis H₀ — the boring default, "no effect", "the coin is fair", "the new design changed nothing".
  2. State an alternative H₁ — "there is an effect".
  3. Compute a test statistic measuring how far the data sit from what H₀ predicts.
  4. Compute the p-value and compare it to a threshold α (usually 0.05).

The p-value is the single most abused number in science, so be exact about it: it is the probability of seeing data at least this extreme if the null hypothesis were true. A small p-value means the data would be surprising under "no effect", so you reject H₀.

06

Type I, Type II, and power

Because inference works from limited data, you will sometimes be wrong in two distinct ways:

  • Type I error (false positive) — rejecting a true null. You declared an effect that isn't there. Its rate is α, the threshold you chose.
  • Type II error (false negative) — failing to reject a false null. There was a real effect and you missed it. Its rate is β.

A test's power is 1 − β: the chance of catching an effect that's genuinely there. The tension is permanent — tighten α to avoid false alarms and you raise β, missing more real effects. The main lever that improves both is sample size, which is exactly what a power analysis computes before you run a study.

H₀H₁thresholdαβ
The two error types. Under H₀ (left) the shaded tail past the threshold is the Type I rate α — false positives. Under H₁ (right) the overlap below the threshold is the Type II rate β — missed real effects. Power is the rest of the H₁ curve.

07

The multiple-comparisons trap

If you test one hypothesis at α = 0.05, there's a 5% chance of a false positive. Test twenty independent hypotheses and the chance that at least one lights up by pure luck is about 64%. Run enough tests and you're almost guaranteed a "significant" result that means nothing.

This is p-hacking (or data dredging): slicing the data many ways, trying many variables, and reporting only the comparison that crossed 0.05. It's usually not fraud — it's the natural result of looking hard and stopping at the first win. The defences are real: decide your hypotheses before looking, correct the threshold when you run many tests (e.g. Bonferroni: divide α by the number of tests), and hold out data to confirm a finding you discovered.

08

Frequentist vs Bayesian

Everything above is the frequentist tradition: parameters are fixed-but-unknown, probability is long-run frequency, and you reason about the procedure (p-values, confidence intervals). It's the default in most fields and most A/B testing.

The Bayesian alternative treats the unknown parameter as itself having a probability distribution. You start with a prior, apply Bayes' rule with the data's likelihood, and get a posterior — a full distribution of belief. Its credible interval means the intuitive thing people wrongly want a confidence interval to mean: "95% probability the parameter is in here." Bayesian methods shine with small data, prior knowledge worth encoding, or when you need to act on a probability directly. Neither school is "right" — they answer slightly different questions, and a good analyst uses both.

09

Where it shows up in my work

10

Refresh in 60 seconds