Geospatial Analysis & GIS

Almost every dataset has a location attached — an address, a postcode, a set of coordinates — and once you put it on a map, patterns appear that no table would reveal: clusters, gaps, hotspots, corridors. Geospatial analysis is the discipline of working with that "where", and a GIS (Geographic Information System) is the software for it. It's genuinely its own field, because spatial data breaks some of the assumptions the rest of statistics quietly relies on.

It's also a core part of my government work — SA's social and operational data is deeply spatial. This page is the practical foundation: the data types, the few concepts that matter, and — importantly — the specific ways spatial analysis can mislead you if you're not careful.

Why location is different

Spatial data isn't just data with two extra columns. It comes with a deep principle, often called the first law of geography: everything is related to everything else, but near things are more related than distant things. Crime, disease, income, house prices — they cluster in space, and that clustering is usually the signal you care about.

That same clustering is also why ordinary statistics can mislead on spatial data: standard methods assume observations are independent, but neighbouring places aren't — they influence and resemble each other. So spatial analysis needs its own tools, and a healthy wariness of applying non-spatial ones blindly.

Vector and raster

Geographic data comes in two fundamental forms, and knowing which you have shapes everything:

Vector — discrete shapes: points (an incident, an address), lines (roads, rivers), and polygons (suburbs, council areas, states). Each has attributes attached, like rows in a table. Best for distinct features and boundaries.
Raster — a continuous grid of cells, each holding a value: a satellite image, an elevation surface, a heat map. Best for things that vary continuously across space (temperature, rainfall, density).

Most analytical work — joining records to areas, mapping rates by region — is vector. Rasters come in for imagery, terrain, and continuous surfaces. Many real projects use both together.

Coordinates and projections

The Earth is a sphere; a map is flat. Squashing one onto the other is a projection, and there's no way to do it without distorting something — area, shape, distance, or direction. Every spatial dataset carries a coordinate reference system (CRS) that says how its coordinates map to real positions, and the most common source of silent spatial bugs is mixing datasets in different ones.

The spatial join

The workhorse operation of spatial analysis is the spatial join — combining datasets by location rather than by a shared key. Instead of "match where the IDs are equal" (the database join), it's "match where the geometries relate": which suburb does this point fall inside? which incidents are within 500 metres of this site? how many addresses sit in each council area?

This is how you connect a list of events to the regions you want to analyse them by — and it's the bridge from raw points to the area-level rates you can map and compare. It's the spatial equivalent of the join that ties the whole relational world together, and just as central.

Choropleth maps done right

The choropleth — regions shaded by a value — is the most common thematic map, and the most commonly done wrong. Two rules make the difference between an honest map and a misleading one.

First, normalise: shade rates, not raw counts. A map of "number of cases per suburb" mostly shows where the people are — big or populous areas light up simply because they're big. Convert to a rate (cases per 1,000 people, incidents per square kilometre) so you're comparing like with like. Choropleths are for normalised numeric data, not raw totals and not categories.

Second, choose the classification deliberately. How you bucket the values into colour bands — equal interval, quantiles, or natural breaks (Jenks) — changes which regions look high or low, and can completely alter the story. There's no single right choice, but there is a responsibility to pick one that reflects the real distribution rather than the one that flatters your point.

The MAUP trap

The deepest trap in spatial analysis has an unglamorous name: the Modifiable Areal Unit Problem (MAUP). It says that when you aggregate point data into areas, the boundaries you choose can change — even reverse — your results. The same underlying data can tell different stories depending on how you carve up the map. It has two faces:

Scale effect — the size of the units. Aggregate to states, to council areas, or to small census blocks and the same data shows different patterns, even though nothing real changed.
Zone effect — the shape of the units. Redraw the boundaries at the same scale (different districts, different groupings) and the result shifts — the mechanism behind gerrymandering.

The zone effect, concretely. The exact same points appear in both panels — only the boundary is redrawn. With a vertical split the left region is the 'hotspot' (7 vs 5); redraw it as a horizontal split and the top region wins (8 vs 4). Same data, opposite conclusion — purely from the boundary choice.

The lesson isn't that spatial analysis is hopeless — it's that the choice of geographic unit is a real analytical decision with real consequences, not a neutral given. Be explicit about why you chose the units you did, and check whether your conclusion survives a different choice.

Spatial autocorrelation

Because near things resemble each other, spatial data exhibits spatial autocorrelation — neighbouring areas tend to have similar values. Measures like Moran's I quantify it: is the pattern clustered (high values next to high), dispersed, or random? Detecting and locating clusters — hotspots of high values and cold-spots of low — is often the entire point of the analysis.

It also matters for honesty: spatial autocorrelation violates the independence assumption behind ordinary regression, so a naïve model on spatial data understates its uncertainty and can manufacture significance that isn't there. The fix is spatial models that build the neighbour-relationships in — the spatial cousins of the methods on the modelling pages.

Common pitfalls

A quick field guide to the mistakes that bite hardest:

Mismatched CRS — layers that don't line up; always check first.
Mapping raw counts — you've drawn a population map; normalise to rates.
Cherry-picked classification — bands chosen to flatter the story.
Ignoring MAUP — treating one set of boundaries as the truth.
Non-spatial stats on spatial data — ignoring autocorrelation and overstating significance.

Where it shows up in my work

Refresh in 60 seconds

Most data has a "where", and near things are more related than distant ones — so spatial data clusters and breaks the independence ordinary stats assume.
Two data types: vector (points/lines/polygons) and raster (a value grid). Always check the CRS / projection — mismatches are the #1 silent bug.
The spatial join matches by location, not key (which suburb is this point in?) — the bridge from points to area rates.
Choropleths: shade rates not raw counts, and choose the classification (equal interval / quantile / natural breaks) deliberately.
MAUP: the boundaries you pick (scale + zone) can change or reverse the result — the unit choice is a real decision.
Spatial autocorrelation (Moran's I) finds hotspots — and means you need spatial models, not naïve regression, for honest inference.

The choropleth, classification, and MAUP guidance on this page reflects current cartography and spatial-analysis references alongside hands-on government work.