Skip to content
Knowledge

/knowledge/bayesian-statistics

Bayesian Statistics

Statistics as belief-updating. Start with what you think, see some data, and revise — by a rule that is mathematically the only consistent way to learn from evidence.

Studied
Bayesian StatisticsMaster of Data Science
When
UniMelb, 2023–2024
Applied in
Reasoning under uncertainty
Read / Refreshed
~17 min read2026-06-25

There are two ways to think about probability, and they lead to two whole traditions of statistics. The frequentist view says a probability is a long-run frequency — and the parameter you're trying to estimate is a fixed, unknown number. The Bayesian view says a probability is a degree of belief — and so the unknown parameter itself has a probability distribution, describing how strongly you believe each possible value. That one shift changes everything downstream.

The appeal of the Bayesian approach is that it matches how we actually reason: you hold a belief, evidence arrives, and you update. This page builds directly on Bayes' rule from the probability page and turns it from a formula into a complete philosophy of learning from data. I've kept it slow and foundational — every step is spelled out, including the algebra.

01

Probability as belief

Suppose you want to know the true conversion rate of a new web page. A frequentist treats that rate as a single fixed number and asks "what data would this number produce?" A Bayesian treats it as uncertain and describes their belief about it with a distribution — maybe "probably around 10%, but could plausibly be anywhere from 5% to 20%."

That distribution is the whole point. Instead of collapsing to a single guess, the Bayesian carries the full shape of their uncertainty through every calculation. When new data arrives, the distribution gets sharper. You never stop having a distribution — you just get more confident about where the truth sits. Three pieces of vocabulary name the stages:

  • Prior — what you believe before seeing the data.
  • Likelihood — how probable the observed data is, for each possible value of the parameter.
  • Posterior — your updated belief after combining the two.

02

The updating engine

Bayes' rule is the machine that turns a prior into a posterior. Writing θ\theta for the unknown parameter and DD for the observed data:

p(θD)posterior=p(Dθ)likelihood  p(θ)priorp(D)evidence\underbrace{p(\theta \mid D)}_{\text{posterior}} = \frac{\overbrace{p(D \mid \theta)}^{\text{likelihood}}\;\overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(D)}_{\text{evidence}}}

Posterior ∝ Likelihood × Prior. The denominator is just a normalising constant that makes the posterior integrate to one.

Read it as a sentence: your updated belief is your prior belief, re-weighted by how well each parameter value predicted the data you actually saw. Parameter values that made the data likely get their belief boosted; values that made it unlikely get suppressed. Because the denominator p(D)p(D) doesn't depend on θ\theta, it's just a constant that rescales everything to sum to one — which is why the rule is most usefully remembered in its proportional form:

p(θD)    p(Dθ)p(θ)p(\theta \mid D) \;\propto\; p(D \mid \theta)\, p(\theta)

03

Where Bayes' rule comes from

Bayes' rule isn't an extra assumption — it falls straight out of the definition of conditional probability. Start from the fact that the joint probability of two events can be factored two equivalent ways:

p(A,B)=p(AB)p(B)=p(BA)p(A)p(A, B) = p(A \mid B)\,p(B) = p(B \mid A)\,p(A)

Both expressions equal the same joint probability, so set the right-hand sides equal and divide by p(B)p(B):

p(AB)p(B)=p(BA)p(A)p(AB)=p(BA)p(A)p(B)p(A \mid B)\,p(B) = p(B \mid A)\,p(A) \quad\Longrightarrow\quad p(A \mid B) = \frac{p(B \mid A)\,p(A)}{p(B)}

Substitute θ\theta for AA and the data DD for BB and you have the Bayesian engine above. The maths is elementary; the interpretation — that p(θ)p(\theta) is a belief you're allowed to hold and update — is the bold part.

04

A worked example: the coin

Nothing makes this concrete like watching one update happen. Suppose you have a coin and want to learn its bias θ\theta — the probability it lands heads. You flip it nn times and see kk heads.

The prior. Belief about a probability lives on the interval [0,1][0, 1], and the natural distribution there is the Beta distribution, Beta(α,β)\text{Beta}(\alpha, \beta). Its two parameters act like counts of imagined prior heads and tails, so Beta(1,1)\text{Beta}(1, 1) is flat — "I have no idea, any bias is equally plausible."

The likelihood. The probability of seeing kk heads in nn flips, for a given bias, is the Binomial likelihood θk(1θ)nk\theta^{k}(1-\theta)^{n-k}.

The update. Multiply prior by likelihood (the proportional form) and watch what happens to the exponents:

p(θD)θk(1θ)nklikelihoodθα1(1θ)β1prior=θα+k1(1θ)β+nk1=Beta(α+k,  β+nk)\begin{aligned} p(\theta \mid D) &\propto \underbrace{\theta^{k}(1-\theta)^{n-k}}_{\text{likelihood}} \cdot \underbrace{\theta^{\alpha-1}(1-\theta)^{\beta-1}}_{\text{prior}} \\[4pt] &= \theta^{\,\alpha + k - 1}\,(1-\theta)^{\,\beta + n - k - 1} \\[4pt] &= \text{Beta}(\alpha + k,\; \beta + n - k) \end{aligned}

The posterior is another Beta distribution — you just add your observed heads to α\alpha and your observed tails to β\beta. When the posterior has the same form as the prior like this, the prior is called conjugate, and the update is pure arithmetic. Start at Beta(1,1)\text{Beta}(1,1), flip 8 heads in 10, and your belief becomes Beta(9,3)\text{Beta}(9, 3) — peaked near 0.75 but still honestly uncertain.

priorBayes'ruleposteriordata (likelihood)posterior becomes next prior
The Bayesian loop. Prior belief meets the likelihood of the observed data; Bayes' rule fuses them into a posterior — which becomes the prior for the next batch of evidence. Each cycle sharpens the distribution.

05

Choosing a prior

The prior is the Bayesian's most powerful tool and most common criticism. It lets you fold in genuine knowledge — but it also means two analysts can reach different conclusions from the same data. How you pick it matters:

  • Informative priors encode real prior knowledge ("past trials put this drug's success near 30%"). They help most when data is scarce, steadying an estimate that little data would otherwise leave wild.
  • Weak / uninformative priors stay deliberately vague (a flat Beta(1,1)\text{Beta}(1,1)), letting the data dominate. A common honest default.

06

Credible vs confidence intervals

Once you have a posterior distribution, summarising it is easy and — finally — intuitive. A 95% credible interval is any range containing 95% of the posterior probability, and it means exactly what people wish a confidence interval meant:

P(aθbD)=0.95P(a \le \theta \le b \mid D) = 0.95

"Given the data, there's a 95% probability the parameter is in this range" — a direct statement about the parameter. Contrast the frequentist confidence interval, whose 95% is a property of the long-run procedure, not of any single interval. The Bayesian version is what most people incorrectly assume a confidence interval already says — and getting to say it honestly is a real selling point of the approach.

07

Why it gets hard

If Bayes' rule is so clean, why isn't everything Bayesian? The trouble is that denominator. The evidence p(D)p(D) requires summing the likelihood × prior over every possible parameter value — an integral:

p(D)=p(Dθ)p(θ)dθp(D) = \int p(D \mid \theta)\, p(\theta)\, d\theta

For the conjugate coin it has a tidy closed form. But for a realistic model with dozens or thousands of parameters, this is a high-dimensional integral with no analytic solution and far too many points to grid out. For decades that intractable integral was the wall that kept Bayesian methods mostly theoretical. The breakthrough was to stop trying to compute it.

08

MCMC: sampling the posterior

The insight that made Bayesian statistics practical: you rarely need the posterior's formula — you just need to be able to draw samples from it. With enough samples you can estimate any summary you want (the mean, a credible interval) by simply measuring the sample. And you can sample a distribution even when you only know it up to that pesky constant.

Markov Chain Monte Carlo (MCMC) does exactly this. It builds a random walk through parameter space whose rule is rigged so that it lingers in high-posterior regions in proportion to their probability. The classic Metropolis-Hastings recipe is intuitive:

  • Stand at the current parameter value, and propose a nearby random step.
  • If the proposal has higher posterior density, move there. If lower, move there only sometimes — with probability equal to the ratio of the two densities.
  • Record where you are, and repeat — for thousands of steps.

Crucially, that acceptance ratio cancels the intractable p(D)p(D) — it appears top and bottom and divides out — so you never have to compute the integral. The collected trail of positions is a sample from the posterior. Modern tools (Gibbs sampling, Hamiltonian Monte Carlo, Stan, PyMC) are smarter versions of this same idea, and they're what make Bayesian modelling usable on real problems today.

09

Where it shows up in my work

10

Refresh in 60 seconds