Clustering

Most machine learning is supervised — you have labelled examples to learn from. Clustering is the opposite: there are no labels, no correct answer, just data, and the task is to discover the natural groupings hidden inside it. Which customers behave alike? Which documents are about the same thing? Which regions share a pattern? Clustering answers those without anyone ever defining the groups in advance.

It's the headline example of unsupervised learning, and it pairs directly with the PCA page: reduce dimensions first, then cluster in the cleaner low-D space. This page builds the two workhorses — k-means and hierarchical — from first principles, and is honest about when each one lies to you.

Finding groups without labels

A cluster is a set of points that are more similar to each other than to points outside it. That's the whole goal: maximise similarity within a group and difference between groups. Because there's no ground truth, clustering is genuinely exploratory — you're forming hypotheses about structure, not predicting a known target.

That freedom is also the catch. There's no single correct clustering of a dataset — the "right" answer depends on what you mean by similar, how many groups you ask for, and which algorithm's assumptions match your data's shape. So the craft is less about running the algorithm and more about choosing those things well, then sanity-checking the result.

Distance and similarity

Everything in clustering rests on a notion of how far apart two points are. The default is Euclidean distance — ordinary straight-line distance:

d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}

But it isn't always the right one. Cosine similarity (the angle between vectors) is better when direction matters more than magnitude — the same measure that compares text embeddings on the NLP page. And there's a trap carried straight over from PCA: in high dimensions, distances concentrate — every pair of points ends up roughly equidistant, and "nearest" stops meaning anything.

k-means

k-means is the most-used clustering algorithm, and its appeal is simplicity. You tell it how many clusters you want ( $k$ ), and it finds $k$ centre points (centroids) and assigns every point to its nearest one. Formally it minimises the total squared distance from points to their cluster's centroid — the within-cluster sum of squares (also called inertia):

J = \sum_{j=1}^{k} \sum_{\mathbf{x} \in C_j} \lVert \mathbf{x} - \boldsymbol{\mu}_j \rVert^2

J measures how tight the clusters are. k-means searches for the centroids that make it smallest.

You can't minimise that directly, but a beautifully simple loop — Lloyd's algorithm — does it by alternating two steps until nothing moves:

Assign — put each point in the cluster of its nearest centroid.
Update — move each centroid to the mean of the points now assigned to it: $\boldsymbol{\mu}_j = \frac{1}{|C_j|}\sum_{\mathbf{x}\in C_j}\mathbf{x}$ .

Each round can only lower $J$ , so it always converges. The catch: it converges to a local minimum that depends on the random starting centroids, so in practice you run it several times (the k-means++ initialisation spreads the starts out) and keep the best. It's fast and scales well — which is why, despite its flaws, it's everywhere.

Lloyd's algorithm. Starting from random centroids, k-means alternates assigning points to the nearest centroid and moving each centroid to its points' mean — repeating until the centroids stop moving.

Choosing k

k-means makes you pick the number of clusters up front, which feels like cheating — if you knew the groups, you wouldn't need to cluster. Two standard tools help you choose:

The elbow method — run k-means for a range of $k$ , plot the inertia $J$ against $k$ . It always falls (more clusters fit tighter), but the rate of improvement bends sharply at a point — the "elbow" — beyond which extra clusters barely help. The same elbow logic as the PCA scree plot.
The silhouette score — measures how well each point sits in its cluster versus the next-nearest one; you pick the $k$ that maximises the average. More principled than the elbow, and it doubles as a quality check (see below).

Where k-means fails

k-means quietly assumes your clusters are round, similarly sized, and equally dense — because it carves space into straight-edged regions around centroids. When that assumption is wrong, it confidently returns the wrong answer:

Non-spherical shapes — two crescent moons or concentric rings get sliced straight through, because k-means can only draw round blobs.
Unequal sizes or densities — a big sparse cluster gets eaten by a small dense one nearby.
Outliers — because it uses means, a few extreme points drag centroids away from the real centre.
You must pre-specify k — and it will always find exactly that many clusters, even if the data has none.

Each failure points to a different tool — which is why you need more than one clustering method in your kit.

Hierarchical clustering

Hierarchical clustering takes a completely different angle: it builds a whole tree of clusters instead of a single flat grouping, and you don't have to choose $k$ in advance. The common agglomerative (bottom-up) version is intuitive:

Start with every point as its own cluster.
Repeatedly merge the two closest clusters.
Continue until everything is one cluster.

What "closest" means between clusters (not points) is the linkage choice — single (nearest pair), complete (farthest pair), or average — and it strongly shapes the result. The output is a dendrogram: a tree showing every merge and the distance at which it happened. You "cut" the tree at a height to get whatever number of clusters you want, reading the structure off afterwards rather than committing to it first.

A dendrogram. Points merge into clusters from the bottom up; the height of each merge is how far apart the groups were. Cut horizontally at any height to read off that many clusters.

Density-based clustering

A third family fixes k-means' shape problem directly. DBSCAN defines clusters as dense regions separated by sparse ones: a cluster grows by chaining together points that each have enough neighbours within a small radius. Because it follows density rather than distance-to-a-centre, it can trace clusters of any shape — those crescent moons k-means mangles — and it does two things k-means can't: it finds the number of clusters itself, and it labels low-density points as noise rather than forcing every point into a group. The trade-off is that it struggles when clusters have very different densities, and it has its own parameters to tune.

Evaluating clusters

With no labels, how do you know if a clustering is any good? You measure whether points sit comfortably in their assigned group. The silhouette score does this per point: let $a$ be its average distance to others in its own cluster and $b$ its average distance to the nearest other cluster. Then

s = \frac{b - a}{\max(a, b)} \quad \in [-1, 1]

A value near +1 means the point is snug in its cluster and far from others (good); near 0 means it's on a boundary; and negative means it's probably in the wrong cluster. Average it across all points and you have a single, label-free quality number — useful both for judging a clustering and for choosing $k$ . But no metric replaces the real test: do the clusters mean something you can act on?

Where it shows up in my work

The map you draw before you know the territory

Clustering is how I find structure in data nobody has organised yet. For segmentation — grouping people, regions, or behaviours into cohorts that warrant different treatment — it's the natural first move, and in intelligence work those cohorts often are the finding. As EDA it earns its keep early: run it on a fresh dataset and the clusters, and especially the outliers DBSCAN flags as noise, point straight at what's worth a closer look.

The discipline the page describes is the part that matters in practice: scale and reduce first (the PCA pairing), never trust a single $k$ or a single algorithm, and always ask whether a statistically tidy cluster is a real, actionable group — because clustering will always return something, whether or not it means anything.

Refresh in 60 seconds

Clustering = unsupervised grouping: maximise similarity within groups, difference between them. No ground truth, so it's exploratory.
Everything rests on distance (Euclidean / cosine). Standardise and reduce dimensions first or distances mislead.
k-means minimises within-cluster sum of squares via assign→update (Lloyd's); fast, but you pick $k$ , it finds local optima, and it assumes round, equal clusters.
Choose $k$ with the elbow or silhouette. k-means fails on non-spherical / unequal / outlier-heavy data.
Hierarchical builds a dendrogram (no k upfront; cut to taste); DBSCAN follows density (any shape, finds k itself, labels noise).
Evaluate label-free with the silhouette $s=\frac{b-a}{\max(a,b)}$ — but the real test is whether the clusters are actionable.