/knowledge/bayesian-statistics
Bayesian Statistics
Statistics as belief-updating. Start with what you think, see some data, and revise — by a rule that is mathematically the only consistent way to learn from evidence.
- Studied
- Bayesian StatisticsMaster of Data Science
- When
- UniMelb, 2023–2024
- Applied in
- Reasoning under uncertainty
- Read / Refreshed
- ~17 min read2026-06-25
There are two ways to think about probability, and they lead to two whole traditions of statistics. The frequentist view says a probability is a long-run frequency — and the parameter you're trying to estimate is a fixed, unknown number. The Bayesian view says a probability is a degree of belief — and so the unknown parameter itself has a probability distribution, describing how strongly you believe each possible value. That one shift changes everything downstream.
The appeal of the Bayesian approach is that it matches how we actually reason: you hold a belief, evidence arrives, and you update. This page builds directly on Bayes' rule from the probability page and turns it from a formula into a complete philosophy of learning from data. I've kept it slow and foundational — every step is spelled out, including the algebra.
01
Probability as belief
Suppose you want to know the true conversion rate of a new web page. A frequentist treats that rate as a single fixed number and asks "what data would this number produce?" A Bayesian treats it as uncertain and describes their belief about it with a distribution — maybe "probably around 10%, but could plausibly be anywhere from 5% to 20%."
That distribution is the whole point. Instead of collapsing to a single guess, the Bayesian carries the full shape of their uncertainty through every calculation. When new data arrives, the distribution gets sharper. You never stop having a distribution — you just get more confident about where the truth sits. Three pieces of vocabulary name the stages:
- Prior — what you believe before seeing the data.
- Likelihood — how probable the observed data is, for each possible value of the parameter.
- Posterior — your updated belief after combining the two.
02
The updating engine
Bayes' rule is the machine that turns a prior into a posterior. Writing for the unknown parameter and for the observed data:
Posterior ∝ Likelihood × Prior. The denominator is just a normalising constant that makes the posterior integrate to one.
Read it as a sentence: your updated belief is your prior belief, re-weighted by how well each parameter value predicted the data you actually saw. Parameter values that made the data likely get their belief boosted; values that made it unlikely get suppressed. Because the denominator doesn't depend on , it's just a constant that rescales everything to sum to one — which is why the rule is most usefully remembered in its proportional form:
03
Where Bayes' rule comes from
Bayes' rule isn't an extra assumption — it falls straight out of the definition of conditional probability. Start from the fact that the joint probability of two events can be factored two equivalent ways:
Both expressions equal the same joint probability, so set the right-hand sides equal and divide by :
Substitute for and the data for and you have the Bayesian engine above. The maths is elementary; the interpretation — that is a belief you're allowed to hold and update — is the bold part.
04
A worked example: the coin
Nothing makes this concrete like watching one update happen. Suppose you have a coin and want to learn its bias — the probability it lands heads. You flip it times and see heads.
The prior. Belief about a probability lives on the interval , and the natural distribution there is the Beta distribution, . Its two parameters act like counts of imagined prior heads and tails, so is flat — "I have no idea, any bias is equally plausible."
The likelihood. The probability of seeing heads in flips, for a given bias, is the Binomial likelihood .
The update. Multiply prior by likelihood (the proportional form) and watch what happens to the exponents:
The posterior is another Beta distribution — you just add your observed heads to and your observed tails to . When the posterior has the same form as the prior like this, the prior is called conjugate, and the update is pure arithmetic. Start at , flip 8 heads in 10, and your belief becomes — peaked near 0.75 but still honestly uncertain.
05
Choosing a prior
The prior is the Bayesian's most powerful tool and most common criticism. It lets you fold in genuine knowledge — but it also means two analysts can reach different conclusions from the same data. How you pick it matters:
- Informative priors encode real prior knowledge ("past trials put this drug's success near 30%"). They help most when data is scarce, steadying an estimate that little data would otherwise leave wild.
- Weak / uninformative priors stay deliberately vague (a flat ), letting the data dominate. A common honest default.
06
Credible vs confidence intervals
Once you have a posterior distribution, summarising it is easy and — finally — intuitive. A 95% credible interval is any range containing 95% of the posterior probability, and it means exactly what people wish a confidence interval meant:
"Given the data, there's a 95% probability the parameter is in this range" — a direct statement about the parameter. Contrast the frequentist confidence interval, whose 95% is a property of the long-run procedure, not of any single interval. The Bayesian version is what most people incorrectly assume a confidence interval already says — and getting to say it honestly is a real selling point of the approach.
07
Why it gets hard
If Bayes' rule is so clean, why isn't everything Bayesian? The trouble is that denominator. The evidence requires summing the likelihood × prior over every possible parameter value — an integral:
For the conjugate coin it has a tidy closed form. But for a realistic model with dozens or thousands of parameters, this is a high-dimensional integral with no analytic solution and far too many points to grid out. For decades that intractable integral was the wall that kept Bayesian methods mostly theoretical. The breakthrough was to stop trying to compute it.
08
MCMC: sampling the posterior
The insight that made Bayesian statistics practical: you rarely need the posterior's formula — you just need to be able to draw samples from it. With enough samples you can estimate any summary you want (the mean, a credible interval) by simply measuring the sample. And you can sample a distribution even when you only know it up to that pesky constant.
Markov Chain Monte Carlo (MCMC) does exactly this. It builds a random walk through parameter space whose rule is rigged so that it lingers in high-posterior regions in proportion to their probability. The classic Metropolis-Hastings recipe is intuitive:
- Stand at the current parameter value, and propose a nearby random step.
- If the proposal has higher posterior density, move there. If lower, move there only sometimes — with probability equal to the ratio of the two densities.
- Record where you are, and repeat — for thousands of steps.
Crucially, that acceptance ratio cancels the intractable — it appears top and bottom and divides out — so you never have to compute the integral. The collected trail of positions is a sample from the posterior. Modern tools (Gibbs sampling, Hamiltonian Monte Carlo, Stan, PyMC) are smarter versions of this same idea, and they're what make Bayesian modelling usable on real problems today.
09
Where it shows up in my work
10