Causal Inference & A/B Testing

Almost every decision worth making is a causal one. Will this policy reduce harm? Did that change improve the outcome? Would the result have been different if we'd acted? Yet the data we have is overwhelmingly correlational — it tells us what went together, not what caused what. Causal inference is the discipline of bridging that gap: getting from "these two things move together" to "this one made that one happen", and being honest about how much confidence the bridge can bear.

It's the question I care about most in government-analyst work, because the alternative — mistaking a coincidence for an effect — leads to acting on things that don't work and crediting interventions for changes they didn't cause. This page is the toolkit, from the gold-standard experiment to the methods you reach for when you can't run one.

Correlation isn't enough

The famous warning — correlation does not imply causation — is true but usually under-explained. When two things, $X$ and $Y$ , move together, there are several possibilities, and only one is the one you want:

$X$ causes $Y$ (what you hope).
$Y$ causes $X$ (reverse causation).
Some third thing $Z$ causes both (a confounder — the classic "ice-cream sales and drownings both rise with temperature").
It's coincidence (especially with small samples or many comparisons).

The whole field is machinery for ruling out the second, third, and fourth so you're left with the first. The cleanest way to do that is to intervene — and that's where experiments come in.

The counterfactual: what would have happened

The modern way to define a causal effect is the potential outcomes framework. For a unit (a person, a region, a case), imagine two parallel worlds: one where it receives the treatment, with outcome $Y(1)$ , and one where it doesn't, with outcome $Y(0)$ . The causal effect for that unit is the difference:

\tau_i = Y_i(1) - Y_i(0)

Here's the catch, and it has a grand name: the fundamental problem of causal inference. For any single unit you only ever observe one of those two worlds — the person either got the treatment or didn't. The other outcome, the counterfactual, is forever missing. You can never measure an individual effect directly.

The gold standard: randomise

How do you make two groups comparable in everything — including things you didn't measure or never thought of? You can't match them by hand on infinite variables. But there is one almost magical trick: assign the treatment at random. This is the randomised controlled trial (RCT).

Randomisation works because, with enough units, it makes the treatment and control groups statistically identical on average — same age mix, same prior behaviour, same everything, measured or not. Any confounder is balanced across both groups by chance, so the only systematic difference left is the treatment itself. That's why the simple difference in group averages becomes a credible causal estimate:

\hat{\tau} = \bar{Y}_{\text{treated}} - \bar{Y}_{\text{control}}

Randomisation is the only method that handles unknown confounders for free. Every observational method below is, in essence, an attempt to approximate what randomisation gives you automatically.

A/B testing: the RCT in the wild

An A/B test is just an RCT run on a product or process: split users at random into A (control) and B (treatment), show each group a different version, and compare a chosen metric. It's the workhorse of evidence-based decisions — and getting it right is more subtle than "ship it and check":

Power and sample size first. Decide before you start how big an effect you care about and how many units you need to detect it (the statistical power calculation). Underpowered tests fail to find real effects and waste the experiment.
Don't peek. Repeatedly checking results and stopping the moment they look significant inflates false positives badly — every peek is another roll of the dice. Fix the sample size (or use a proper sequential-testing method) and wait.
One change, one metric. Define the primary metric up front. Testing twenty metrics and celebrating whichever turns significant is just multiple comparisons in disguise.
Check the randomisation held. Sanity-check that the groups really are balanced on known covariates, and watch for leakage (users in both arms, network spillover between them).

Confounders, colliders & DAGs

When you can't randomise, you have to reason explicitly about which variables to adjust for — and the surprise is that adjusting for the wrong one makes things worse. A causal diagram (a DAG — directed acyclic graph) draws each variable as a node and each causal arrow between them, making the structure visible.

A confounder (Z) sits upstream of both treatment and outcome and creates a spurious association — you must adjust for it. A collider (C) sits downstream of both; adjusting for it opens a fake association that wasn't there. Same-looking variables, opposite advice.

A confounder is a common cause of both treatment and outcome — leave it unadjusted and it fakes an effect; adjusting for it removes the bias. A collider is a common effect of both — and adjusting for it creates a spurious association that wasn't there. They look similar and demand opposite handling, which is exactly why drawing the diagram first beats blindly "controlling for everything".

When you can't randomise

Often randomising is impossible or unethical — you can't randomly assign a policy, a major life event, or who gets investigated. Quasi-experimental methods exploit natural variation to mimic an experiment. The main ones, weakest assumptions to strongest:

Matching / regression adjustment — build a comparison group that looks like the treated group on observed variables (propensity-score matching is the common flavour). Only as good as the confounders you measured.
Difference-in-differences — compare the change over time in a treated group against the change in an untreated group. If both groups would have moved in parallel without the treatment, the extra movement is the effect. Cancels out anything fixed about each group.
Instrumental variables — find a variable that nudges treatment but affects the outcome only through it, and use it to isolate causal variation.
Regression discontinuity — when treatment switches at a sharp threshold (a cutoff score, an age limit), units just either side are near-identical, so comparing them approximates a local experiment.

Traps that fake causation

Even careful analysts get fooled. The recurring traps:

Simpson's paradox — a trend that appears in every subgroup can reverse when the groups are combined (or vice versa). Aggregation can flip the sign of an effect, so always ask whether a lurking variable is splitting the data.
Selection bias — when who ends up in your data is related to the outcome (only successful cases get recorded, only certain people respond). The sample no longer represents the population, and effects get manufactured.
Regression to the mean — extreme values tend to be followed by less extreme ones for no causal reason. Act after a spike and the natural settling looks like your intervention worked.
p-hacking — slicing, re-testing, and trying specifications until something crosses significance. Tie this back to multiple comparisons: enough tests guarantee a "finding" that's pure noise. Pre-register the question.

Where it shows up in my work

Did the intervention move the needle?

In government-analyst work the causal question is the one that matters: did a policy, program, or intervention actually change the outcome — or would it have changed anyway? You rarely get to randomise a policy, so the craft is reaching honestly for the right quasi-experimental tool — a difference-in-differences against a comparable area, a regression discontinuity at an eligibility cutoff — and being clear about the assumption it rests on, rather than letting a before-after correlation masquerade as proof.

It also keeps me honest about the traps: a drop after an intervention might be regression to the mean, a subgroup pattern might be Simpson's paradox, and a confident effect might vanish once the confounder is drawn into the picture. Getting this right is the difference between advice that holds up and advice that just sounds data-driven.

Refresh in 60 seconds

Causal inference gets from "they move together" to "this caused that", ruling out reverse causation, confounding, and coincidence.
An effect is $Y(1) - Y(0)$ — but you only ever see one world per unit (the fundamental problem). So estimate an average from a comparable treated vs control group.
Randomisation (RCT / A/B test) is the gold standard — it balances unknown confounders for free. A/B tips: power up front, don't peek, one primary metric, check balance.
Draw a DAG: adjust for confounders (common causes), never for colliders (common effects — adjusting fakes an association).
Can't randomise? Matching, difference-in-differences, instrumental variables, regression discontinuity — weaker, assumption-dependent approximations of an experiment.
Watch the traps: Simpson's paradox, selection bias, regression to the mean, p-hacking.

The internal-validity spectrum (RCT → RDD/DiD → matching) and A/B pitfalls (peeking, power, multiple metrics) reflect current causal-inference and experimentation references alongside coursework.