Skip to content
Knowledge

/knowledge/differential-privacy

Differential Privacy

You can publish useful statistics about a population without exposing any individual in it — but only with a precise, mathematical definition of privacy. Stripping names was never enough; this is what actually works.

Studied
Differential Privacy & Privacy-Preserving AnalysisIn practice · sharing data safely
When
Gov analysis · ongoing
Applied in
Releasing stats responsibly
Read / Refreshed
~15 min read2026-06-26

A core tension runs through any work with data about people: you want to publish something useful — counts, averages, trends — without revealing anything about a single individual in the data. For decades the answer was "anonymise it": strip the names and release the rest. We now know, conclusively, that anonymisation doesn't work — and differential privacy (DP) is the rigorous, mathematical replacement, the first definition of privacy that actually holds up against a determined attacker.

It's a topic I care about directly, because publishing aggregate statistics responsibly — the bread and butter of a government analyst — is exactly what DP is built for. This page is the practical idea: why the old approach failed, what "differentially private" precisely means, the noise mechanism that delivers it, and the trade-off you can't escape.

01

Anonymous isn't anonymous

The fatal flaw in "just remove the identifiers" is re-identification by linkage. Even without names, the combination of a few seemingly innocuous fields — a quasi-identifier like postcode + birth date + sex — is often unique to one person, and can be matched against a public dataset to put the name back.

The cautionary cases are famous: Latanya Sweeney re-identified a state governor's medical record from "anonymised" hospital data using just those three fields; researchers de-anonymised Netflix's released ratings by matching them to public IMDb reviews; AOL's "anonymised" search logs were traced to real people. The lesson is brutal and general: you cannot anonymise rich data by redaction, because the data itself fingerprints people. A fundamentally different approach is needed.

02

Why k-anonymity falls short

The first serious attempt was k-anonymity: generalise or suppress quasi-identifiers until every record is indistinguishable from at least k1k-1 others (so no one stands alone). It's intuitive and helps — but it has real holes. If everyone in a k-anonymous group shares the same sensitive value (say, all have the same diagnosis), you learn that value about everyone in the group without singling anyone out (the homogeneity attack). And its guarantee evaporates against an attacker with side information you didn't anticipate.

The deeper problem is that k-anonymity is a property of the released table, and reasons about the attacks you thought of. What you want instead is a guarantee about the process that holds against any attacker with any side knowledge — which is exactly what DP provides.

03

The differential idea

Differential privacy reframes the question entirely. Instead of "is this output anonymous?", it asks: does the output change perceptibly depending on whether any single person is in the dataset or not? If a query's result is essentially the same whether you're included or excluded, then the result can't be revealing much about you specifically — your presence is undetectable.

04

Epsilon & the privacy budget

The formal definition makes "barely changes" precise. A mechanism MM is ε\varepsilon-differentially private if, for any two datasets DD and DD' differing in a single person's record, and any possible output SS:

Pr[M(D)S]    eεPr[M(D)S]\Pr[M(D) \in S] \;\leq\; e^{\varepsilon} \cdot \Pr[M(D') \in S]

The parameter ε\varepsilon (epsilon) is the privacy budget, and it's the dial that governs everything. A small ε\varepsilon means the two probabilities must be nearly equal — strong privacy, because adding or removing a person barely changes the output distribution. A large ε\varepsilon permits bigger differences — weaker privacy. It's a genuine budget: every query you answer about the data spends some of it, and once it's gone, further queries would erode the guarantee, so you must ration it across everything you publish.

05

How it's done: calibrated noise

How do you make a query satisfy that definition? You add carefully calibrated random noise to the answer. Want to release a count? Compute it, then add a random draw from a Laplace (or Gaussian) distribution before publishing.

dataquerytrue answernoisy release+ noise(ε)safe to publish
The differential-privacy mechanism. The true answer is computed, then deliberately blurred with random noise calibrated to the privacy budget ε before release. The noisy answer stays useful in aggregate while hiding any one person's contribution.

The amount of noise is tuned to two things: the privacy budget ε\varepsilon, and the query's sensitivity — how much one person could change the result (one person changes a count by at most 1, so a count needs little noise; a sum of incomes can swing a lot, so it needs more). The magic is that the noise is large enough to mask any single individual's contribution, yet — across a large dataset — averages out, so the aggregate stays accurate. Crucially, DP also composes: the guarantees of multiple queries add up predictably, which is what makes the budget bookkeeping work.

06

The privacy-utility trade-off

There's no free lunch, and DP is refreshingly honest about it: more privacy means more noise means less accuracy. Push ε\varepsilon down for strong privacy and your published numbers get noisier and less useful; raise it for accurate numbers and you weaken the protection. This privacy-utility trade-off is the central, unavoidable tension of the whole field.

What DP gives you is not an escape from the trade-off but the ability to quantify and choose it explicitly — to set ε\varepsilon as a deliberate, defensible policy decision rather than crossing your fingers. The US Census Bureau adopted DP for the 2020 census (with a sizeable epsilon, itself a public, debated choice), and Apple and Google use it to gather usage statistics without collecting individuals' raw behaviour.

07

Where the noise goes: local vs global

There are two places to add the noise, and the choice reflects who you trust:

  • Global (central) DP — a trusted curator holds the real data, runs the query, and adds noise to the output. Less noise for the same privacy (more accurate), but you must trust the curator with the raw data.
  • Local DP — each person's data is randomised before it ever leaves their device, so even the collector never sees the truth. The toy intuition is randomised response: to survey a sensitive yes/no question, each respondent secretly flips a coin and sometimes answers randomly — individuals are deniable, yet the true proportion is recoverable in aggregate. Stronger trust model, but it needs much more noise. (This is what Apple/Google use.)

08

Where it shows up in my work

09

Refresh in 60 seconds

The re-identification cases, the ε/budget definition, the noise mechanism, and the local-vs-global distinction reflect current differential-privacy references alongside hands-on work.