Advanced Database Systems

The database systems page covered the relational model, SQL, and why databases beat files. This is the sequel: what's actually going on inside the engine. How does it turn your SQL into a fast plan? How do thousands of simultaneous users not trample each other? How does it survive a crash, or split across a hundred machines? These are the questions that separate "I can write a query" from "I understand the system I'm betting my data on."

It builds on the foundation page and ties to cluster & cloud computing once we go distributed. The recurring theme: every powerful feature is a trade-off, and knowing which one a system made tells you how it behaves when things get hard.

Under the hood

When you run a SQL query, you say what you want, never how to get it — that declarative gap is the whole point. Behind the scenes the database turns that request into an executable plan and runs it as efficiently as it can. The components that make this work — the optimiser, the transaction manager, the storage engine — are what we unpack here. Understanding them is what lets you diagnose a slow query or a mysterious deadlock instead of guessing.

Query optimisation

The same query can be executed many ways — which table to read first, which index to use, which join algorithm — and they can differ in speed by orders of magnitude. The query optimiser is the brain that chooses, and it's one of the most sophisticated pieces of software in any database.

It works in stages: parse the SQL, generate candidate execution plans, estimate the cost of each using statistics about the data (how many rows, how values are distributed), and pick the cheapest. This is cost-based optimisation — the optimiser is effectively predicting which plan will touch the fewest rows and do the least I/O. It's why keeping table statistics up to date matters, and why the same query can suddenly turn slow when the optimiser's estimates drift from reality.

The query optimiser's pipeline. SQL is parsed, candidate plans are generated, each is costed using data statistics, and the cheapest plan is executed. The same query, many possible plans — the optimiser picks one.

Transactions and concurrency

The foundation page introduced transactions and the ACID guarantees. The hard part is delivering them when thousands of transactions run at once — that's concurrency control, and it's where databases earn their keep. Run transactions naively in parallel and they corrupt each other (one reads a value another is half-way through changing); run them strictly one-at-a-time and the system grinds to a halt. The job is to allow as much parallelism as possible while preserving the illusion that each transaction ran alone.

Databases offer this on a dial of isolation levels, and there are two broad strategies to enforce it:

Locking — a transaction locks the data it touches so others must wait. Safe, but contention-prone, and it can produce deadlocks (two transactions each waiting on a lock the other holds), which the database detects and breaks by aborting one.
MVCC (Multi-Version Concurrency Control) — instead of locking, keep multiple versions of each row, so readers see a consistent snapshot while writers create new versions. Readers never block writers and vice-versa. It's how Postgres and most modern databases get high concurrency, and it's the better default for read-heavy analytical work.

Storage internals

How data physically sits on disk decides performance, and there are two dominant designs:

B-trees — the classic structure behind most relational indexes (from the foundation page). Balanced for fast reads and good for read-heavy, update-in-place workloads. The default for decades.
LSM-trees (Log-Structured Merge trees) — batch writes in memory and flush them sequentially to disk, merging in the background. They make writes extremely fast, which is why write-heavy systems like Cassandra and many NoSQL stores use them — at some cost to read speed.

Underneath both sits the feature that makes durability real: the write-ahead log (WAL). Before changing the actual data, the database records the change in an append-only log. If it crashes mid-operation, it replays the log on restart to recover to a consistent state — nothing committed is ever lost. It's the unglamorous mechanism behind the "D" in ACID.

Going distributed

When data or traffic outgrows one machine, the database must spread across many — and the same scaling reality from the cluster & cloud page applies. Two techniques:

Partitioning / sharding — split the data across nodes (users A–M here, N–Z there) so each holds a slice. This scales capacity and write throughput, but cross-shard queries get harder.
Replication — keep copies of the same data on several nodes, for fault tolerance (a node can die) and read scaling (serve reads from any copy). But now you must keep the copies in sync — which is where it gets deep.

Consistency models

The instant you replicate data, you confront the CAP theorem(from the cluster & cloud page): when the network between nodes fails, you must choose between consistency (every read sees the latest write) and availability (every request still gets an answer). You can't have both during a partition. This forces a choice of consistency model:

Strong consistency — every read returns the most recent write, always. Simple to reason about, but slower and less available, since nodes must coordinate before answering. The right call for a bank balance.
Eventual consistency — reads might briefly return stale data, but all copies converge given time. Fast and highly available — the right call for a social feed or a like count, where a moment of staleness is harmless.

Neither is "correct"; each suits different needs. Recognising which model a system chose tells you exactly how it will behave when a node or network fails — and that's the question that matters in production.

The NoSQL families

To get that scale and flexibility, NoSQL databases relax the rigid relational model. They're not one thing but a family, each shaped for a kind of data:

Document (MongoDB) — store flexible JSON-like documents; great when the schema varies.
Key-value (Redis) — a giant fast dictionary; ideal for caching and sessions.
Wide-column (Cassandra) — huge tables spread across many nodes, write-optimised.
Graph (Neo4j) — model entities and relationships directly; built for connected data like networks.

The trade-off is the recurring one: most NoSQL stores drop some of the relational guarantees (rich joins, strict schemas, full ACID) in exchange for scale, flexibility, or speed on a particular shape of data. They complement the relational database rather than replace it — you pick the store that fits the job.

Analytical databases

A final, important split — the OLTP vs OLAP distinction from the foundation page, taken to its hardware conclusion. Transactional databases store data row by row (fast to read or write a whole record). Analytical databases — data warehouses like BigQuery, Snowflake, Redshift — store it column by column. Columnar storage is transformational for analytics: a query that sums one column reads only that column off disk, not every row, and similar values packed together compress beautifully. It's why the same "big query over history" runs in seconds on a warehouse and minutes on a transactional database — and why serious analytics lives in a separate, columnar store.

Where it shows up in my work

Refresh in 60 seconds

The query optimiser turns declarative SQL into the cheapest execution plan using data statistics (cost-based) — keep stats fresh.
Concurrency control preserves the illusion each transaction ran alone: locking (can deadlock) vs MVCC (versioned snapshots, readers don't block writers). Isolation levels trade safety for speed.
Storage: B-trees (read-optimised) vs LSM-trees (write-optimised); the write-ahead log delivers durability.
Scale out with partitioning/sharding and replication — which forces the CAP choice: strong vs eventual consistency.
NoSQL families (document/key-value/wide-column/graph) drop some relational guarantees for scale/flexibility — complements, not replaces.
Analytical/columnar stores (warehouses) read only the columns a query needs — seconds vs minutes for big aggregates. Every feature is a trade-off.