Differential Privacy

A core tension runs through any work with data about people: you want to publish something useful — counts, averages, trends — without revealing anything about a single individual in the data. For decades the answer was "anonymise it": strip the names and release the rest. We now know, conclusively, that anonymisation doesn't work — and differential privacy (DP) is the rigorous, mathematical replacement, the first definition of privacy that actually holds up against a determined attacker.

It's a topic I care about directly, because publishing aggregate statistics responsibly — the bread and butter of a government analyst — is exactly what DP is built for. This page is the practical idea: why the old approach failed, what "differentially private" precisely means, the noise mechanism that delivers it, and the trade-off you can't escape.

Anonymous isn't anonymous

The fatal flaw in "just remove the identifiers" is re-identification by linkage. Even without names, the combination of a few seemingly innocuous fields — a quasi-identifier like postcode + birth date + sex — is often unique to one person, and can be matched against a public dataset to put the name back.

The cautionary cases are famous: Latanya Sweeney re-identified a state governor's medical record from "anonymised" hospital data using just those three fields; researchers de-anonymised Netflix's released ratings by matching them to public IMDb reviews; AOL's "anonymised" search logs were traced to real people. The lesson is brutal and general: you cannot anonymise rich data by redaction, because the data itself fingerprints people. A fundamentally different approach is needed.

Why k-anonymity falls short

The first serious attempt was k-anonymity: generalise or suppress quasi-identifiers until every record is indistinguishable from at least $k-1$ others (so no one stands alone). It's intuitive and helps — but it has real holes. If everyone in a k-anonymous group shares the same sensitive value (say, all have the same diagnosis), you learn that value about everyone in the group without singling anyone out (the homogeneity attack). And its guarantee evaporates against an attacker with side information you didn't anticipate.

The deeper problem is that k-anonymity is a property of the released table, and reasons about the attacks you thought of. What you want instead is a guarantee about the process that holds against any attacker with any side knowledge — which is exactly what DP provides.

The differential idea

Differential privacy reframes the question entirely. Instead of "is this output anonymous?", it asks: does the output change perceptibly depending on whether any single person is in the dataset or not? If a query's result is essentially the same whether you're included or excluded, then the result can't be revealing much about you specifically — your presence is undetectable.

Epsilon & the privacy budget

The formal definition makes "barely changes" precise. A mechanism $M$ is $\varepsilon$ -differentially private if, for any two datasets $D$ and $D'$ differing in a single person's record, and any possible output $S$ :

\Pr[M(D) \in S] \;\leq\; e^{\varepsilon} \cdot \Pr[M(D') \in S]

The parameter $\varepsilon$ (epsilon) is the privacy budget, and it's the dial that governs everything. A small $\varepsilon$ means the two probabilities must be nearly equal — strong privacy, because adding or removing a person barely changes the output distribution. A large $\varepsilon$ permits bigger differences — weaker privacy. It's a genuine budget: every query you answer about the data spends some of it, and once it's gone, further queries would erode the guarantee, so you must ration it across everything you publish.

How it's done: calibrated noise

How do you make a query satisfy that definition? You add carefully calibrated random noise to the answer. Want to release a count? Compute it, then add a random draw from a Laplace (or Gaussian) distribution before publishing.

The differential-privacy mechanism. The true answer is computed, then deliberately blurred with random noise calibrated to the privacy budget ε before release. The noisy answer stays useful in aggregate while hiding any one person's contribution.

The amount of noise is tuned to two things: the privacy budget $\varepsilon$ , and the query's sensitivity — how much one person could change the result (one person changes a count by at most 1, so a count needs little noise; a sum of incomes can swing a lot, so it needs more). The magic is that the noise is large enough to mask any single individual's contribution, yet — across a large dataset — averages out, so the aggregate stays accurate. Crucially, DP also composes: the guarantees of multiple queries add up predictably, which is what makes the budget bookkeeping work.

The privacy-utility trade-off

There's no free lunch, and DP is refreshingly honest about it: more privacy means more noise means less accuracy. Push $\varepsilon$ down for strong privacy and your published numbers get noisier and less useful; raise it for accurate numbers and you weaken the protection. This privacy-utility trade-off is the central, unavoidable tension of the whole field.

What DP gives you is not an escape from the trade-off but the ability to quantify and choose it explicitly — to set $\varepsilon$ as a deliberate, defensible policy decision rather than crossing your fingers. The US Census Bureau adopted DP for the 2020 census (with a sizeable epsilon, itself a public, debated choice), and Apple and Google use it to gather usage statistics without collecting individuals' raw behaviour.

Where the noise goes: local vs global

There are two places to add the noise, and the choice reflects who you trust:

Global (central) DP — a trusted curator holds the real data, runs the query, and adds noise to the output. Less noise for the same privacy (more accurate), but you must trust the curator with the raw data.
Local DP — each person's data is randomised before it ever leaves their device, so even the collector never sees the truth. The toy intuition is randomised response: to survey a sensitive yes/no question, each respondent secretly flips a coin and sometimes answers randomly — individuals are deniable, yet the true proportion is recoverable in aggregate. Stronger trust model, but it needs much more noise. (This is what Apple/Google use.)

Where it shows up in my work

Publishing stats without exposing people

Releasing aggregate statistics from sensitive data is a routine part of government-analyst work, and this page is the rigorous answer to "is it safe to publish?" The first thing it changes is the instinct: stripping identifiers is not enough — re-identification by linkage is real, so the safety has to come from the process, not from hoping the data is anonymous. DP is how you make a release that holds up against an attacker with outside knowledge.

And the privacy-utility trade-off reframes it as an explicit, defensible choice: setting $\varepsilon$ is a policy decision about how much accuracy to trade for how much protection, made openly rather than by accident. It's the technical complement to data governance (the policy) and fairness (the other responsibility owed to the people in the data) — together, the toolkit for handling data about humans without harming them.

Refresh in 60 seconds

Anonymisation fails — re-identification by linkage (quasi-identifiers like postcode+DOB+sex; Sweeney, Netflix, AOL). You can't redact your way to privacy.
k-anonymity helps but breaks (homogeneity attack, unknown side info). It reasons about the table, not the process.
Differential privacy: does the output change if any one person is in or out? If not, you're protected — privacy is a property of the algorithm.
$\varepsilon$ is the privacy budget: small ε = strong privacy + more noise; it's spent across queries (composition).
The mechanism: add calibrated noise (Laplace/Gaussian), tuned to ε and query sensitivity. Masks individuals, averages out in aggregate.
Unavoidable privacy-utility trade-off (Census 2020, Apple/Google). Global DP (trusted curator, less noise) vs local DP (randomise on-device, more noise).

The re-identification cases, the ε/budget definition, the noise mechanism, and the local-vs-global distinction reflect current differential-privacy references alongside hands-on work.