Probability

Every dataset is a sample, every model has error bars, and every prediction is really a statement about likelihood. Probability is the rigorous language for all of it — the foundation under statistics, the engine inside Bayesian methods, and the thing that lets you say not just "this will happen" but "this will happen, and here's how sure I am."

If linear algebra is the grammar of data's shape, probability is the grammar of its uncertainty. This page builds from the three axioms up to the two theorems that make statistics possible — and spends real time on Bayes' rule, because getting it wrong is the most expensive mistake in applied data work.

The language of uncertainty

There are two honest ways to read a probability, and good data scientists hold both. The frequentist view: a probability is the long-run frequency of an event if you repeated the experiment forever — a fair coin is "0.5 heads" because that's the limit of the proportion. The Bayesian view: a probability is a degree of belief, updated as evidence arrives — useful when you can't repeat the experiment ("what's the chance this customer churns?").

They rarely disagree on the maths; they frame different questions. The axioms below hold for both.

Sample spaces, events, axioms

Three pieces of vocabulary, then the whole edifice:

Sample space (Ω) — the set of all possible outcomes. For one die roll, {1,2,3,4,5,6}.
Event — any subset of the sample space. "Roll an even number" is the event {2,4,6}.
Probability — a number assigned to each event, obeying three rules.

Everything in probability follows from Kolmogorov's three axioms:

Probabilities are never negative: P(A) ≥ 0.
Something in the sample space happens for certain: P(Ω) = 1.
For mutually exclusive events, probabilities add: P(A ∪ B) = P(A) + P(B).

That's it. The complement rule (P(not A) = 1 − P(A)) and the general addition rule (P(A ∪ B) = P(A) + P(B) − P(A ∩ B), which subtracts the double-counted overlap) are both consequences, not new assumptions.

Conditional probability & independence

Most real questions are conditional: not "what's the probability of rain?" but "what's the probability of rain given the sky is grey?" Conditional probability is the probability of A once you know B has happened:

P(A \mid B) = \frac{P(A \cap B)}{P(B)}

You're rescaling the world to the slice where B is true, then asking how much of that slice also has A. Rearranging gives the multiplication rule P(A ∩ B) = P(A | B) · P(B).

Two events are independent when knowing one tells you nothing about the other — P(A | B) = P(A), equivalently P(A ∩ B) = P(A) · P(B). Independence is an assumption you should earn, not assume: it's what lets you multiply probabilities, and wrongly assuming it (correlated features, repeated measurements on the same person) quietly corrupts a lot of models.

Bayes' rule

Bayes' rule is how you flip a conditional around — turning P(evidence | hypothesis), which you can often measure, into P(hypothesis | evidence), which is what you actually want:

P(H \mid E) = \frac{P(E \mid H)\,P(H)}{P(E)}

Read it as belief-updating: P(H) is your prior (belief before evidence), P(E | H) is the likelihood (how well the hypothesis predicts the evidence), and P(H | E) is the posterior (belief after). The denominator just normalises so it's a valid probability.

The base-rate example as a tree. Of 1,000 people, the 1% false-positive rate on 999 healthy people produces ~10 false alarms — far more than the single true positive. P(sick | positive) ≈ 1 / 11 ≈ 9%.

Random variables

A random variable is a number attached to a random outcome — the bridge from events to arithmetic. "Number of heads in 10 flips" or "tomorrow's temperature" are random variables. Two kinds:

Discrete — countable values (a dice total, a click count). Described by a probability mass function P(X = x) that gives each value's probability.
Continuous — values on a range (height, time). Described by a probability density function; here probability is area under the curve, so you ask for P(a ≤ X ≤ b) — the probability of any single exact value is zero.

Distributions worth knowing

A handful of distributions cover an enormous share of real problems. Recognising which one fits a situation is half of applied probability.

Bernoulli — a single yes/no trial with probability p (one coin flip, one conversion).
Binomial — the number of successes in n independent Bernoulli trials (conversions from 1,000 visitors).
Poisson — the count of rare events in a fixed window (support tickets per hour, typos per page).
Normal (Gaussian) — the bell curve; the default model for measurements clustered around a mean, and — thanks to the theorem below — the distribution that sums and averages tend toward.

The normal distribution. About 68% of values fall within one standard deviation of the mean, 95% within two, 99.7% within three — the rule of thumb behind most confidence intervals.

Expectation and variance

Two numbers summarise most of what you need from a distribution. The expectation (or mean) is the long-run average — each value weighted by its probability:

\mathbb{E}[X] = \sum_{x} x\,P(X = x)

The variance measures spread — the average squared distance from the mean. Its square root, the standard deviation σ, is in the same units as the data, which is why it's the one you usually quote:

\operatorname{Var}(X) = \mathbb{E}\!\left[(X - \mu)^2\right]

Mean tells you where the distribution sits; variance tells you how much you can trust any single draw to be near it. A forecast without a variance is half a forecast.

The two limit theorems

Two results are why statistics works at all — they connect the messy single sample you actually have to the clean behaviour of the population.

The Law of Large Numbers: as you collect more independent samples, their average converges to the true mean. It's the formal promise that more data really does pin down the answer — and the licence behind every "we ran it 10,000 times" simulation.

The Central Limit Theorem is the deeper magic: the average of many independent random variables is approximately normal, no matter what distribution the originals came from. Skewed, lumpy, weird — average enough of them and you get a bell curve. This is why the normal distribution is everywhere, and why you can put confidence intervals around a sample mean without knowing the underlying distribution. It's the bridge from probability to inferential statistics.