/knowledge/differential-privacy
Differential Privacy
You can publish useful statistics about a population without exposing any individual in it — but only with a precise, mathematical definition of privacy. Stripping names was never enough; this is what actually works.
- Studied
- Differential Privacy & Privacy-Preserving AnalysisIn practice · sharing data safely
- When
- Gov analysis · ongoing
- Applied in
- Releasing stats responsibly
- Read / Refreshed
- ~15 min read2026-06-26
A core tension runs through any work with data about people: you want to publish something useful — counts, averages, trends — without revealing anything about a single individual in the data. For decades the answer was "anonymise it": strip the names and release the rest. We now know, conclusively, that anonymisation doesn't work — and differential privacy (DP) is the rigorous, mathematical replacement, the first definition of privacy that actually holds up against a determined attacker.
It's a topic I care about directly, because publishing aggregate statistics responsibly — the bread and butter of a government analyst — is exactly what DP is built for. This page is the practical idea: why the old approach failed, what "differentially private" precisely means, the noise mechanism that delivers it, and the trade-off you can't escape.
01
Anonymous isn't anonymous
The fatal flaw in "just remove the identifiers" is re-identification by linkage. Even without names, the combination of a few seemingly innocuous fields — a quasi-identifier like postcode + birth date + sex — is often unique to one person, and can be matched against a public dataset to put the name back.
The cautionary cases are famous: Latanya Sweeney re-identified a state governor's medical record from "anonymised" hospital data using just those three fields; researchers de-anonymised Netflix's released ratings by matching them to public IMDb reviews; AOL's "anonymised" search logs were traced to real people. The lesson is brutal and general: you cannot anonymise rich data by redaction, because the data itself fingerprints people. A fundamentally different approach is needed.
02
Why k-anonymity falls short
The first serious attempt was k-anonymity: generalise or suppress quasi-identifiers until every record is indistinguishable from at least others (so no one stands alone). It's intuitive and helps — but it has real holes. If everyone in a k-anonymous group shares the same sensitive value (say, all have the same diagnosis), you learn that value about everyone in the group without singling anyone out (the homogeneity attack). And its guarantee evaporates against an attacker with side information you didn't anticipate.
The deeper problem is that k-anonymity is a property of the released table, and reasons about the attacks you thought of. What you want instead is a guarantee about the process that holds against any attacker with any side knowledge — which is exactly what DP provides.
03
The differential idea
Differential privacy reframes the question entirely. Instead of "is this output anonymous?", it asks: does the output change perceptibly depending on whether any single person is in the dataset or not? If a query's result is essentially the same whether you're included or excluded, then the result can't be revealing much about you specifically — your presence is undetectable.
04
Epsilon & the privacy budget
The formal definition makes "barely changes" precise. A mechanism is -differentially private if, for any two datasets and differing in a single person's record, and any possible output :
The parameter (epsilon) is the privacy budget, and it's the dial that governs everything. A small means the two probabilities must be nearly equal — strong privacy, because adding or removing a person barely changes the output distribution. A large permits bigger differences — weaker privacy. It's a genuine budget: every query you answer about the data spends some of it, and once it's gone, further queries would erode the guarantee, so you must ration it across everything you publish.
05
How it's done: calibrated noise
How do you make a query satisfy that definition? You add carefully calibrated random noise to the answer. Want to release a count? Compute it, then add a random draw from a Laplace (or Gaussian) distribution before publishing.
The amount of noise is tuned to two things: the privacy budget , and the query's sensitivity — how much one person could change the result (one person changes a count by at most 1, so a count needs little noise; a sum of incomes can swing a lot, so it needs more). The magic is that the noise is large enough to mask any single individual's contribution, yet — across a large dataset — averages out, so the aggregate stays accurate. Crucially, DP also composes: the guarantees of multiple queries add up predictably, which is what makes the budget bookkeeping work.
06
The privacy-utility trade-off
There's no free lunch, and DP is refreshingly honest about it: more privacy means more noise means less accuracy. Push down for strong privacy and your published numbers get noisier and less useful; raise it for accurate numbers and you weaken the protection. This privacy-utility trade-off is the central, unavoidable tension of the whole field.
What DP gives you is not an escape from the trade-off but the ability to quantify and choose it explicitly — to set as a deliberate, defensible policy decision rather than crossing your fingers. The US Census Bureau adopted DP for the 2020 census (with a sizeable epsilon, itself a public, debated choice), and Apple and Google use it to gather usage statistics without collecting individuals' raw behaviour.
07
Where the noise goes: local vs global
There are two places to add the noise, and the choice reflects who you trust:
- Global (central) DP — a trusted curator holds the real data, runs the query, and adds noise to the output. Less noise for the same privacy (more accurate), but you must trust the curator with the raw data.
- Local DP — each person's data is randomised before it ever leaves their device, so even the collector never sees the truth. The toy intuition is randomised response: to survey a sensitive yes/no question, each respondent secretly flips a coin and sometimes answers randomly — individuals are deniable, yet the true proportion is recoverable in aggregate. Stronger trust model, but it needs much more noise. (This is what Apple/Google use.)
08
Where it shows up in my work
09
Refresh in 60 seconds
The re-identification cases, the ε/budget definition, the noise mechanism, and the local-vs-global distinction reflect current differential-privacy references alongside hands-on work.