Cluster & Cloud Computing

For a long time, computers got faster every year and you could just wait for the hardware to catch up to your data. That free lunch ended — single processors stopped getting dramatically faster — and datasets kept growing. The answer is distributed computing: instead of one bigger machine, use many machines working together. Simple to say, genuinely hard to do well.

This is the engineering that lets analysis run at real-world scale — the database and ML work on these pages all hit it eventually. I learned it on Melbourne's SPARTAN supercomputer and Spark; here's the whole picture, including the catch that no amount of hardware can buy your way past.

The scaling wall

There are two ways to get more computing power, and the difference matters:

Scaling up (vertical) — buy a bigger machine: more RAM, more cores, faster disks. Simple, but there's a hard ceiling and the price climbs steeply.
Scaling out (horizontal) — add more ordinary machines and split the work across them. Near-limitless and cheap per unit, but now your program has to be written to run in pieces across a network.

Big data lives in the scaling-out world. The moment a dataset won't fit in one machine's memory — or one machine would take a week to process it — you need a cluster: a group of networked computers (nodes) coordinated to act as one. That shift solves the size problem and creates three new ones — splitting the work, coordinating the pieces, and surviving the failures that become inevitable once you have hundreds of machines.

Parallelism and Amdahl's law

Running work in parallel comes in two flavours: data parallelism (split the data, run the same operation on each chunk — the dominant pattern in analytics) and task parallelism (different machines do different jobs at once). Either way, you hit a fundamental limit that every distributed engineer must respect.

Amdahl's law says your speed-up is capped by the part of the work that can't be parallelised. If a fraction $p$ of a job is parallelisable and you throw $N$ processors at it, the best speed-up you can get is:

\text{speed-up} = \frac{1}{(1 - p) + \dfrac{p}{N}}

p: the parallelisable fraction. N: number of processors. The serial part (1−p) sets a hard ceiling no amount of hardware beats.

The lesson is sobering: if 10% of your job is inherently serial ( $p = 0.9$ ), then even with infinite processors you can never go more than 10× faster — because as $N \to \infty$ , the speed-up approaches $1/(1-p) = 10$ . More machines have sharply diminishing returns, and the serial bottleneck, not the hardware, is what you must attack. It's why "just add more nodes" so often disappoints.

Clusters, HPC, and MPI

High-Performance Computing (HPC) is the classic cluster world: a supercomputer is really a few thousand nodes wired together with a very fast network, shared by many researchers. You don't run things interactively — you submit a job to a scheduler (like Slurm), which queues it and allocates nodes when they're free. Melbourne's SPARTAN is exactly this.

To make many nodes cooperate on one computation, the traditional tool is MPI (Message Passing Interface). Because the nodes don't share memory, they coordinate by explicitly sending messages to each other — "here's my piece of the result, combine it with yours." It's powerful and fast but low-level: you manage the communication by hand, which is precise but error-prone. The big-data frameworks that followed exist largely to hide this complexity.

MapReduce

MapReduce, popularised by Google, was the breakthrough that made distributed data processing accessible. Its insight: express your computation as just two functions, and let the framework handle all the hard distributed plumbing — splitting data, scheduling, moving results, and recovering from failures.

Map — applied to each chunk of data in parallel across the cluster, emitting key-value pairs (e.g. for word count, emit (word, 1) for every word).
Shuffle — the framework groups all values by key and moves them so each key's data lands on one node.
Reduce — combines the values for each key into the final result (sum the 1s → the count per word).

You write two simple functions; the framework turns them into a fault-tolerant job across a thousand machines. The cost is rigidity — many problems are awkward to force into map-then-reduce, and chaining steps means writing slow intermediate results to disk each time. That last weakness is exactly what Spark fixed.

MapReduce. Input is split across nodes; map runs in parallel emitting key-value pairs; the shuffle groups them by key; reduce combines each key's values into the output. The framework handles the distribution and failures.

Spark

Apache Spark is the modern successor, and its key advance is in-memory computing. Where MapReduce wrote intermediate results to disk between every step, Spark keeps data in the cluster's RAM across steps — making multi-stage jobs (and especially iterative ones like machine learning) dramatically faster, often by 10–100×.

Two ideas make it work:

RDDs / DataFrames — a distributed collection spread across the cluster that you manipulate as if it were a single object, while Spark runs the operations in parallel underneath.
Lazy evaluation — Spark doesn't run your transformations as you write them; it builds a plan (a graph of operations) and only executes when you ask for a result, letting it optimise the whole pipeline and recompute lost pieces after a failure.

The result is a tool that feels like writing ordinary data code (Spark even speaks SQL and a pandas-like API) but runs across a cluster — which is why it's the default for large-scale analytics today.

The cloud

A cluster used to mean buying and racking your own machines. The cloud changed the economics: rent computing from AWS, Azure, or Google on demand and pay only for what you use. Its defining feature is elasticity — spin up 100 machines for an hour to crunch a job, then shut them down — turning a huge capital purchase into a small operating cost. Providers sell it at three levels of abstraction:

IaaS (Infrastructure) — raw virtual machines and storage; you manage the rest. Maximum control.
PaaS (Platform) — a managed environment to run your code; the provider handles the servers and scaling.
SaaS (Software) — finished applications you just use (Gmail, this site's analytics).

For data work, the cloud's managed services are the real draw: a Spark cluster, a data warehouse, or a model-training rig that you rent for an afternoon instead of owning. The trade-offs are ongoing cost, vendor lock-in, and putting your data on someone else's infrastructure — which is a live concern for the government and health data I work with.

Storage and the CAP trade-off

Data too big for one machine can't sit on one disk either, so it's spread across the cluster with a distributed file system (HDFS) or object storage (S3) — and replicated, so a dead drive doesn't lose anything. But distributing data forces a deep trade-off, captured by the CAP theorem: when the network between nodes fails (a partition, which will happen), a system can guarantee consistency (everyone sees the same data) or availability (every request still gets an answer) — but not both.

So distributed databases pick a side: a bank's ledger favours consistency (better to refuse than to show a wrong balance); a social feed favours availability (a slightly stale post beats an error). There's no free lunch — and recognising which guarantee a system chose tells you exactly how it will behave when something breaks. It's the distributed echo of the ACID guarantees from the single-machine database page.

Choosing the right tool

The most important skill here is also the most under-rated: knowing when you don't need any of this. Distribution adds enormous complexity — network failures, coordination overhead, harder debugging, Amdahl's ceiling — so the right default is to push a single machine first. Modern servers have hundreds of gigabytes of RAM; a great deal of "big data" fits comfortably on one, and runs faster there than on a cluster whose coordination overhead eats the gains.

Reach for a cluster only when the data genuinely won't fit or the job genuinely won't finish in time — and then prefer a managed framework (Spark on a cloud service) over hand-rolled MPI unless you truly need the low-level control. The rule of thumb: the simplest thing that fits the problem, scaled up before scaled out.

Where it shows up in my work

Refresh in 60 seconds

When data outgrows one machine, scale out (many ordinary nodes) rather than up — a cluster. It solves size but adds splitting, coordination, and failure.
Amdahl's law $\frac{1}{(1-p)+p/N}$ : the serial fraction caps your speed-up — more nodes have diminishing returns.
HPC + MPI: nodes coordinate by explicit message passing (low-level, fast). MapReduce: write map + reduce, the framework distributes it (rigid, disk-heavy).
Spark: in-memory, RDDs/DataFrames + lazy evaluation — 10–100× faster for multi-step/iterative jobs. The modern default.
The cloud rents elastic compute (IaaS/PaaS/SaaS); managed services are the draw, with cost/lock-in/sovereignty trade-offs.
Distributed storage replicates data; the CAP theorem forces a choice between consistency and availability under a network partition. Scale up before out — distribution isn't free.