Database Systems

Models and dashboards get the glory, but they all sit on top of a database. It's the unglamorous foundation that keeps millions of facts correct, consistent, and fast to query — and an analyst who understands it pulls clean data in seconds where others wrestle with it for hours. This is the systems half of the foundation: not the maths of data, but the engineering of storing and retrieving it at scale.

Every analyst role I've held has run on SQL, so this is a page I use the practice of constantly. Here's the whole picture — from why databases exist to how they answer a query in milliseconds.

Why databases exist

You could keep data in spreadsheets or files. For anything real, that breaks fast. A database managed by a database management system (DBMS) exists to solve the problems files can't:

Concurrency — hundreds of users reading and writing at once without corrupting each other's work.
Integrity — rules that keep the data valid (no order pointing at a customer who doesn't exist).
Scale and speed — querying billions of rows in milliseconds, not minutes.
Durability — committed data survives a crash or power loss.

A spreadsheet gives you none of these guarantees. A database is built from the ground up to provide all four at once.

The relational model

The dominant design, unchanged in its essentials since the 1970s, is the relational model: store data in tables (relations), where each row is a record and each column an attribute. Tables are linked by keys:

A primary key uniquely identifies each row (a customer ID).
A foreign key in one table points at the primary key of another, encoding a relationship (an order's customer_idreferencing the customers table).

This simple idea — facts in tables, relationships as key references — is enough to model almost any domain, and its discipline is what prevents the contradictions that plague spreadsheets.

Two related tables. Each Order carries a customer_id foreign key pointing at the Customers table's primary key — so the relationship is stored once, with no duplicated customer details.

SQL: asking for data

You talk to a relational database in SQL (Structured Query Language). Its defining trait is that it's declarative: you describe what you want, not how to get it. You say "give me the total sales per region, sorted high to low" and the database's query planner works out the most efficient way to compute it.

SELECT region, SUM(total) AS sales
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY region
ORDER BY sales DESC;

Four clauses carry most of the work everywhere: SELECT (which columns), FROM (which table), WHERE (filter rows), and — for summaries — GROUP BY with aggregate functions like SUM and COUNT. Add the four write operations (INSERT, UPDATE, DELETE, and CREATE for structure) and you can do nearly everything. Fluency here is the single highest-leverage skill for an analyst.

Normalisation

How you split data across tables matters enormously. Normalisation is the process of organising tables to eliminate redundancy, so each fact is stored exactly once. The motivation is the update anomaly: if a customer's email is copied into all 500 of their order rows, changing it means 500 updates — and miss one and the data now contradicts itself.

The fix is to store the email once in a customers table and reference it by key. The normal forms (1NF, 2NF, 3NF) are a progression of rules for doing this systematically — roughly: one value per cell, every column depending on the whole key, and no column depending on another non-key column.

Joins

Because normalised data lives in separate tables, you reassemble it with a JOIN — matching rows across tables on a key. To list each order with its customer's name, you join orders to customers on orders.customer_id = customers.id. The main kinds:

INNER JOIN — only rows with a match in both tables (orders that have a valid customer).
LEFT JOIN — every row from the left table, with matches from the right where they exist and blanks where they don't (all customers, even those with no orders).

Joins are where SQL gets genuinely expressive — and where beginners trip, usually by accidentally multiplying rows (a join on a non-unique key) or dropping rows (an inner join when they meant a left). Picturing which rows survive the match is the whole game.

Indexing

How does a database find a needle in a billion-row haystack in milliseconds? The same way you find a word in a dictionary — you don't read every page. An index is a separate, sorted data structure (almost always a B-tree) that lets the database jump near-instantly to matching rows instead of scanning the whole table.

A full scan of a billion rows is linear — time proportional to the table size. A B-tree index turns that into roughly log(n) steps: a handful of hops down the tree, even for billions of rows. That's the difference between a query taking minutes and taking milliseconds.

A B-tree index. The query starts at the root and follows a few branches down to the target — a handful of steps for billions of rows, instead of scanning every one.

Transactions and ACID

Some operations must happen all-or-nothing. Transferring money debits one account and credits another — if the system crashes between the two, you can't leave the money vanished. A transaction groups statements into one indivisible unit, and relational databases guarantee them with the ACID properties:

Atomicity — all of it happens, or none of it does. No partial transfers.
Consistency — the database moves from one valid state to another, never breaking its rules.
Isolation — concurrent transactions don't see each other's half-finished work.
Durability — once committed, it survives a crash.

ACID is the bedrock of trust in any system handling money, records, or anything where "mostly correct" isn't good enough — which is most of the government and finance data I've worked with.

OLTP, OLAP, and NoSQL

Not all databases are tuned for the same job, and matching the store to the workload is a real design decision:

OLTP (transactional) — many small, fast reads and writes; normalised; the live system behind an app. Optimised for ACID throughput.
OLAP (analytical) — big aggregate queries over history; often de-normalised into a data warehouse with columnar storage. This is where most analytics and BI actually run.
NoSQL — a family (document, key-value, graph, wide-column) that trades some relational guarantees for scale or flexibility, for data that doesn't fit neat tables. Useful, but not a default replacement for the relational model.

Where it shows up in my work

The skill under every analysis

SQL is the most-used tool across every analyst role I've held — the step before any chart, model, or Power BI report exists. Knowing the systems underneath is what makes it fast: understanding indexes is why I write queries that filter on indexed columns instead of triggering a full scan; understanding joins is how I assemble data from a normalised OLTP source without silently dropping or duplicating rows; and understanding the OLTP/OLAP split is why I model a Power BI data layer for fast aggregate reads rather than querying the live transactional system.

In government work the ACID guarantees and integrity constraints aren't academic — when the data feeds a minister's brief or a compliance decision, "the numbers are consistent and complete" is the whole job.