/knowledge/probability
Probability
The mathematics of uncertainty. Before you can model the world, you need a rigorous way to say how likely something is — and to update that belief when the evidence arrives.
- Studied
- ProbabilityBachelor of Science · Data Science core
- When
- UniMelb, 2019–2022
- Applied in
- Bayesian methods · A/B testing
- Read / Refreshed
- ~15 min read2026-06-25
Every dataset is a sample, every model has error bars, and every prediction is really a statement about likelihood. Probability is the rigorous language for all of it — the foundation under statistics, the engine inside Bayesian methods, and the thing that lets you say not just "this will happen" but "this will happen, and here's how sure I am."
If linear algebra is the grammar of data's shape, probability is the grammar of its uncertainty. This page builds from the three axioms up to the two theorems that make statistics possible — and spends real time on Bayes' rule, because getting it wrong is the most expensive mistake in applied data work.
01
The language of uncertainty
There are two honest ways to read a probability, and good data scientists hold both. The frequentist view: a probability is the long-run frequency of an event if you repeated the experiment forever — a fair coin is "0.5 heads" because that's the limit of the proportion. The Bayesian view: a probability is a degree of belief, updated as evidence arrives — useful when you can't repeat the experiment ("what's the chance this customer churns?").
They rarely disagree on the maths; they frame different questions. The axioms below hold for both.
02
Sample spaces, events, axioms
Three pieces of vocabulary, then the whole edifice:
- Sample space (Ω) — the set of all possible outcomes. For one die roll,
{1,2,3,4,5,6}. - Event — any subset of the sample space. "Roll an even number" is the event
{2,4,6}. - Probability — a number assigned to each event, obeying three rules.
Everything in probability follows from Kolmogorov's three axioms:
- Probabilities are never negative:
P(A) ≥ 0. - Something in the sample space happens for certain:
P(Ω) = 1. - For mutually exclusive events, probabilities add:
P(A ∪ B) = P(A) + P(B).
That's it. The complement rule (P(not A) = 1 − P(A)) and the general addition rule (P(A ∪ B) = P(A) + P(B) − P(A ∩ B), which subtracts the double-counted overlap) are both consequences, not new assumptions.
03
Conditional probability & independence
Most real questions are conditional: not "what's the probability of rain?" but "what's the probability of rain given the sky is grey?" Conditional probability is the probability of A once you know B has happened:
You're rescaling the world to the slice where B is true, then asking how much of that slice also has A. Rearranging gives the multiplication rule P(A ∩ B) = P(A | B) · P(B).
Two events are independent when knowing one tells you nothing about the other — P(A | B) = P(A), equivalently P(A ∩ B) = P(A) · P(B). Independence is an assumption you should earn, not assume: it's what lets you multiply probabilities, and wrongly assuming it (correlated features, repeated measurements on the same person) quietly corrupts a lot of models.
04
Bayes' rule
Bayes' rule is how you flip a conditional around — turning P(evidence | hypothesis), which you can often measure, into P(hypothesis | evidence), which is what you actually want:
Read it as belief-updating: P(H) is your prior (belief before evidence), P(E | H) is the likelihood (how well the hypothesis predicts the evidence), and P(H | E) is the posterior (belief after). The denominator just normalises so it's a valid probability.
05
Random variables
A random variable is a number attached to a random outcome — the bridge from events to arithmetic. "Number of heads in 10 flips" or "tomorrow's temperature" are random variables. Two kinds:
- Discrete — countable values (a dice total, a click count). Described by a probability mass function
P(X = x)that gives each value's probability. - Continuous — values on a range (height, time). Described by a probability density function; here probability is area under the curve, so you ask for
P(a ≤ X ≤ b)— the probability of any single exact value is zero.
06
Distributions worth knowing
A handful of distributions cover an enormous share of real problems. Recognising which one fits a situation is half of applied probability.
- Bernoulli — a single yes/no trial with probability
p(one coin flip, one conversion). - Binomial — the number of successes in
nindependent Bernoulli trials (conversions from 1,000 visitors). - Poisson — the count of rare events in a fixed window (support tickets per hour, typos per page).
- Normal (Gaussian) — the bell curve; the default model for measurements clustered around a mean, and — thanks to the theorem below — the distribution that sums and averages tend toward.
07
Expectation and variance
Two numbers summarise most of what you need from a distribution. The expectation (or mean) is the long-run average — each value weighted by its probability:
The variance measures spread — the average squared distance from the mean. Its square root, the standard deviation σ, is in the same units as the data, which is why it's the one you usually quote:
Mean tells you where the distribution sits; variance tells you how much you can trust any single draw to be near it. A forecast without a variance is half a forecast.
08
The two limit theorems
Two results are why statistics works at all — they connect the messy single sample you actually have to the clean behaviour of the population.
The Law of Large Numbers: as you collect more independent samples, their average converges to the true mean. It's the formal promise that more data really does pin down the answer — and the licence behind every "we ran it 10,000 times" simulation.
The Central Limit Theorem is the deeper magic: the average of many independent random variables is approximately normal, no matter what distribution the originals came from. Skewed, lumpy, weird — average enough of them and you get a bell curve. This is why the normal distribution is everywhere, and why you can put confidence intervals around a sample mean without knowing the underlying distribution. It's the bridge from probability to inferential statistics.
09
Where it shows up in my work
10