Bayesian Statistics

There are two ways to think about probability, and they lead to two whole traditions of statistics. The frequentist view says a probability is a long-run frequency — and the parameter you're trying to estimate is a fixed, unknown number. The Bayesian view says a probability is a degree of belief — and so the unknown parameter itself has a probability distribution, describing how strongly you believe each possible value. That one shift changes everything downstream.

The appeal of the Bayesian approach is that it matches how we actually reason: you hold a belief, evidence arrives, and you update. This page builds directly on Bayes' rule from the probability page and turns it from a formula into a complete philosophy of learning from data. I've kept it slow and foundational — every step is spelled out, including the algebra.

Probability as belief

Suppose you want to know the true conversion rate of a new web page. A frequentist treats that rate as a single fixed number and asks "what data would this number produce?" A Bayesian treats it as uncertain and describes their belief about it with a distribution — maybe "probably around 10%, but could plausibly be anywhere from 5% to 20%."

That distribution is the whole point. Instead of collapsing to a single guess, the Bayesian carries the full shape of their uncertainty through every calculation. When new data arrives, the distribution gets sharper. You never stop having a distribution — you just get more confident about where the truth sits. Three pieces of vocabulary name the stages:

Prior — what you believe before seeing the data.
Likelihood — how probable the observed data is, for each possible value of the parameter.
Posterior — your updated belief after combining the two.

The updating engine

Bayes' rule is the machine that turns a prior into a posterior. Writing $\theta$ for the unknown parameter and $D$ for the observed data:

\underbrace{p(\theta \mid D)}_{\text{posterior}} = \frac{\overbrace{p(D \mid \theta)}^{\text{likelihood}}\;\overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(D)}_{\text{evidence}}}

Posterior ∝ Likelihood × Prior. The denominator is just a normalising constant that makes the posterior integrate to one.

Read it as a sentence: your updated belief is your prior belief, re-weighted by how well each parameter value predicted the data you actually saw. Parameter values that made the data likely get their belief boosted; values that made it unlikely get suppressed. Because the denominator $p(D)$ doesn't depend on $\theta$ , it's just a constant that rescales everything to sum to one — which is why the rule is most usefully remembered in its proportional form:

p(\theta \mid D) \;\propto\; p(D \mid \theta)\, p(\theta)

Where Bayes' rule comes from

Bayes' rule isn't an extra assumption — it falls straight out of the definition of conditional probability. Start from the fact that the joint probability of two events can be factored two equivalent ways:

p(A, B) = p(A \mid B)\,p(B) = p(B \mid A)\,p(A)

Both expressions equal the same joint probability, so set the right-hand sides equal and divide by $p(B)$ :

p(A \mid B)\,p(B) = p(B \mid A)\,p(A) \quad\Longrightarrow\quad p(A \mid B) = \frac{p(B \mid A)\,p(A)}{p(B)}

Substitute $\theta$ for $A$ and the data $D$ for $B$ and you have the Bayesian engine above. The maths is elementary; the interpretation — that $p(\theta)$ is a belief you're allowed to hold and update — is the bold part.

A worked example: the coin

Nothing makes this concrete like watching one update happen. Suppose you have a coin and want to learn its bias $\theta$ — the probability it lands heads. You flip it $n$ times and see $k$ heads.

The prior. Belief about a probability lives on the interval $[0, 1]$ , and the natural distribution there is the Beta distribution, $\text{Beta}(\alpha, \beta)$ . Its two parameters act like counts of imagined prior heads and tails, so $\text{Beta}(1, 1)$ is flat — "I have no idea, any bias is equally plausible."

The likelihood. The probability of seeing $k$ heads in $n$ flips, for a given bias, is the Binomial likelihood $\theta^{k}(1-\theta)^{n-k}$ .

The update. Multiply prior by likelihood (the proportional form) and watch what happens to the exponents:

\begin{aligned} p(\theta \mid D) &\propto \underbrace{\theta^{k}(1-\theta)^{n-k}}_{\text{likelihood}} \cdot \underbrace{\theta^{\alpha-1}(1-\theta)^{\beta-1}}_{\text{prior}} \\[4pt] &= \theta^{\,\alpha + k - 1}\,(1-\theta)^{\,\beta + n - k - 1} \\[4pt] &= \text{Beta}(\alpha + k,\; \beta + n - k) \end{aligned}

The posterior is another Beta distribution — you just add your observed heads to $\alpha$ and your observed tails to $\beta$ . When the posterior has the same form as the prior like this, the prior is called conjugate, and the update is pure arithmetic. Start at $\text{Beta}(1,1)$ , flip 8 heads in 10, and your belief becomes $\text{Beta}(9, 3)$ — peaked near 0.75 but still honestly uncertain.

The Bayesian loop. Prior belief meets the likelihood of the observed data; Bayes' rule fuses them into a posterior — which becomes the prior for the next batch of evidence. Each cycle sharpens the distribution.

Choosing a prior

The prior is the Bayesian's most powerful tool and most common criticism. It lets you fold in genuine knowledge — but it also means two analysts can reach different conclusions from the same data. How you pick it matters:

Informative priors encode real prior knowledge ("past trials put this drug's success near 30%"). They help most when data is scarce, steadying an estimate that little data would otherwise leave wild.
Weak / uninformative priors stay deliberately vague (a flat $\text{Beta}(1,1)$ ), letting the data dominate. A common honest default.

Credible vs confidence intervals

Once you have a posterior distribution, summarising it is easy and — finally — intuitive. A 95% credible interval is any range containing 95% of the posterior probability, and it means exactly what people wish a confidence interval meant:

P(a \le \theta \le b \mid D) = 0.95

"Given the data, there's a 95% probability the parameter is in this range" — a direct statement about the parameter. Contrast the frequentist confidence interval, whose 95% is a property of the long-run procedure, not of any single interval. The Bayesian version is what most people incorrectly assume a confidence interval already says — and getting to say it honestly is a real selling point of the approach.

Why it gets hard

If Bayes' rule is so clean, why isn't everything Bayesian? The trouble is that denominator. The evidence $p(D)$ requires summing the likelihood × prior over every possible parameter value — an integral:

p(D) = \int p(D \mid \theta)\, p(\theta)\, d\theta

For the conjugate coin it has a tidy closed form. But for a realistic model with dozens or thousands of parameters, this is a high-dimensional integral with no analytic solution and far too many points to grid out. For decades that intractable integral was the wall that kept Bayesian methods mostly theoretical. The breakthrough was to stop trying to compute it.

MCMC: sampling the posterior

The insight that made Bayesian statistics practical: you rarely need the posterior's formula — you just need to be able to draw samples from it. With enough samples you can estimate any summary you want (the mean, a credible interval) by simply measuring the sample. And you can sample a distribution even when you only know it up to that pesky constant.

Markov Chain Monte Carlo (MCMC) does exactly this. It builds a random walk through parameter space whose rule is rigged so that it lingers in high-posterior regions in proportion to their probability. The classic Metropolis-Hastings recipe is intuitive:

Stand at the current parameter value, and propose a nearby random step.
If the proposal has higher posterior density, move there. If lower, move there only sometimes — with probability equal to the ratio of the two densities.
Record where you are, and repeat — for thousands of steps.

Crucially, that acceptance ratio cancels the intractable $p(D)$ — it appears top and bottom and divides out — so you never have to compute the integral. The collected trail of positions is a sample from the posterior. Modern tools (Gibbs sampling, Hamiltonian Monte Carlo, Stan, PyMC) are smarter versions of this same idea, and they're what make Bayesian modelling usable on real problems today.

Where it shows up in my work

Reasoning the way the world actually works

The Bayesian habit — start from a prior, update on evidence — is how good analysis under uncertainty actually feels, even when I'm not writing a formal model. It's the right frame whenever data is scarce and prior knowledge is genuinely worth something (early-stage experiments, rare events), and whenever a decision needs an honest probability of being right rather than a reject/accept verdict — a credible interval a stakeholder can act on beats a p-value they'll misread.

It also pairs naturally with the rest of the foundation: the probability page gave the rule, the statistics page gave the frequentist contrast, and the same prior-times-likelihood logic underlies the model likelihoods in machine learning. Knowing both schools, and when each fits, is the actual skill.

Refresh in 60 seconds

Bayesian probability is belief; the unknown parameter has a distribution you update as data arrives.
The engine: posterior ∝ likelihood × prior. It falls straight out of the definition of conditional probability.
Conjugate priors make the update arithmetic: a Beta prior + Binomial data → a Beta posterior (add heads to α, tails to β). Today's posterior is tomorrow's prior.
Priors encode knowledge (informative) or step back (weak). Answer the subjectivity critique with enough data + a sensitivity analysis.
A credible interval means what people wish a confidence interval did: P(parameter in range | data) = 0.95.
The evidence integral $p(D)$ is usually intractable, so use MCMC to sample the posterior — the acceptance ratio cancels the constant, so you never compute it.