Sampling & Survey Methodology

There's something almost paradoxical about a poll of 1,500 people claiming to represent a nation of millions — and yet, done properly, it does, to a precision you can quantify. That near-magic is the achievement of sampling theory: the rules for choosing a small group so that what's true of them is reliably true of everyone. Get the sampling right and a few thousand answers tell you about millions. Get it wrong and no sample size in the world saves you — the famous polling disasters were all sampling failures, not arithmetic ones.

This is foundational to trusting any number computed from part of a population, which is nearly every number in practice. This page is how sampling works, why it works, and — most importantly — the quiet ways it fails, because a biased sample produces a confident, precise, completely wrong answer.

Why a sample can stand in for everyone

The premise is that you rarely need to measure everyone (a census) to know about everyone. A well-chosen subset carries the information of the whole, at a fraction of the cost and time. The key word is well-chosen — and the engine that makes it work is a single idea borrowed from the probability page: randomness.

If every member of the population has a known, non-zero chance of being picked, then the laws of probability let you generalise from the sample to the population and put honest error bars on the generalisation. Remove the randomness and that bridge collapses — you have a pile of answers from whoever happened to respond, representing nobody in particular.

Population, frame & sample

Three distinct things get muddled constantly, and the gaps between them are where bias lives:

The population — everyone you want to draw conclusions about (all eligible voters, all residents).
The sampling frame — the actual list you draw from (the electoral roll, a phone directory). It's your operational stand-in for the population.
The sample — those actually selected from the frame and measured.

The crucial, easy-to-miss point: the frame is almost never the population. Anyone in the population but not on the list — people without phones, the unlisted, the unreachable — cannot be sampled, no matter how good your method. That gap is coverage, and it's the first place a study quietly goes wrong.

Probability sampling: the methods that generalise

Probability sampling (every unit has a known chance of selection) is what lets you generalise. The main designs:

Simple random sampling — every unit equally likely; the clean baseline.
Stratified sampling — split the population into groups (strata: age, region) and sample within each, guaranteeing every group is represented in proportion. More efficient and precise when the strata differ from each other.
Cluster sampling — randomly pick whole groups (schools, suburbs) and survey everyone in the chosen ones. Cheaper for spread-out populations, at some cost to precision.
Systematic sampling — take every k-th unit from an ordered list. Simple, but beware hidden periodicity in the list.

Two kinds of error — and the worse one is invisible

This is the distinction that separates people who understand surveys from those who don't:

Sampling error — the random variation from measuring a sample rather than everyone. It's quantifiable (the margin of error), shrinks predictably as the sample grows, and is the error everyone reports.
Non-sampling error — everything else: coverage gaps, non-response, badly worded questions, lying respondents, data-entry mistakes. It does not shrink with sample size, it's usually not quantified, and it's where the big, embarrassing failures come from.

The biases that bite

Two non-sampling biases account for most real-world disasters:

Selection / coverage bias — when the frame systematically misses part of the population, in a way that's related to what you're measuring. The textbook case: the 1936 Literary Digest poll sampled millions from car and telephone registries — wealthier than average in the Depression — and confidently predicted the wrong election winner. Huge sample, fatal coverage bias.
Non-response bias — when those who don't respond differ from those who do. If busy people, or unhappy people, or private people systematically skip the survey, the respondents stop representing the population — and with response rates falling for years, this is the dominant modern worry.

Both share a signature: they're invisible in the data you collected (the respondents look fine on their own), and they don't go away with more responses. You have to reason about who is missing and why.

Weighting it back into shape

When a sample is imbalanced — too few young people, too many from one city — you can partly repair it with weighting. Each respondent is given a weight so that under-represented groups count for more and over-represented groups count for less, pulling the weighted sample's profile back to match known population totals.

The common techniques are post-stratification and raking (iterative proportional fitting), which nudge the weighted margins to match census figures for age, sex, region, and so on. It's a powerful correction — and an honest one only up to a point: weighting can fix imbalance on variables you can measure and know the population totals for. It cannot fix bias on the things you didn't measure, and aggressive weighting inflates the variance (a few heavily-weighted respondents swing the estimate). It's a patch, not a substitute for good sampling.

How big a sample?

A pleasant surprise from the statistics page: the precision of an estimate depends mostly on the absolute sample size, not the fraction of the population. The margin of error for a proportion shrinks with the square root of the sample size:

\text{MoE} \approx z \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}

The $\sqrt{n}$ is the catch: to halve the margin of error you must quadruple the sample. That's why national polls cluster around 1,000–2,000 people (≈ ±2–3%) — beyond that, the sampling error is already small and you get diminishing returns, while the non-sampling error you should actually worry about doesn't budge. Spending on a bigger sample to fix a biased one is the classic misallocation.

Where it shows up in my work

Is this number representative?

As a government analyst, a great deal of work rests on numbers computed from samples or surveys — and the most valuable habit this gives me is asking "representative of what?" before trusting any of them. What was the frame, and who does it miss? Was selection a probability design, or an opt-in that can't be generalised? Most of all, what's the non-response picture — because the quiet, unquantified bias is the one that turns a confident statistic into a misleading one.

It's also the lens for reading others' figures critically: a precise-looking "± 2%" on a self-selected sample is precisely wrong, and a number "weighted to be representative" is only as good as the variables it was weighted on. Knowing where sampling fails is what separates a figure you can brief on from one you should push back on — and it ties straight to the selection-bias and inference ideas elsewhere in this section.

Refresh in 60 seconds

A small random sample can represent a huge population — randomness is the bridge that lets you generalise and put honest error bars on it.
Mind the gaps: population ≠ frame (the list you draw from) ≠ sample. The frame misses people — that's coverage.
Probability sampling (simple / stratified / cluster / systematic) lets you generalise; non-probability (convenience, opt-in) does not — a million opt-ins beat nothing and lose to a good random thousand.
Sampling error is small, quantified, and shrinks with $\sqrt{n}$ . Non-sampling error (coverage, non-response, bad questions) is bigger, unquantified, and doesn't shrink with size.
The killers: coverage bias (Literary Digest) and non-response bias. Weighting/raking repairs known imbalances — not the ones you didn't measure.
To halve the margin of error, quadruple the sample. A bigger sample around a biased estimate is precisely wrong.

The probability-vs-non-probability distinction, non-response/weighting practice, and the sampling-vs-non-sampling-error framing reflect current survey-methodology references alongside statistics coursework.