Topic Modelling

Imagine a folder of ten thousand documents — survey responses, reports, articles, intelligence notes — that nobody has time to read. What are they about? What themes run through them, and which documents share which? Topic modelling answers exactly this: it's a family of unsupervised methods that automatically discover the latent themes ("topics") in a collection of text, without any labels and without being told in advance what to look for.

It's a genuinely distinct tool — not classification (there are no categories to predict) and not general NLP — and it's the clustering idea applied to documents. This page builds up the classic method (LDA), how to read and judge its output, and the modern embedding-based successors — along with the honest warning that the topics it finds aren't always meaningful.

Finding themes without labels

The defining feature is that it's unsupervised: you don't tell it the topics, it discovers them from patterns of word co-occurrence across the documents. If a set of words — "budget", "deficit", "spending", "tax" — keeps appearing together across many documents, that recurring cluster of words is a topic, which a human can then recognise as "fiscal policy". The model finds the statistical structure; the meaning is in how the words group.

This makes it the natural first move on any large, unread text collection: before you can analyse a corpus, you need to know what's in it, and topic modelling gives you that map. The output is two linked things — what topics exist (as word groups), and how much of each topic each document contains.

Documents as bags of words

Classic topic modelling starts from the bag-of-words representation (the same starting point as information retrieval): a document is reduced to the multiset of words it contains, ignoring order entirely. "The cat sat" and "sat the cat" look identical. That sounds lossy — and it is — but for finding themes it works surprisingly well, because a document's subject matter is carried mostly by which words appear and how often, not the order they're in. Topic modelling exploits exactly the co-occurrence patterns this representation preserves.

LDA: the generative story

The canonical method is Latent Dirichlet Allocation (LDA). Its clever move is to imagine a generative story for how documents get written, then run it backwards. The story has two simple ideas:

Each topic is a distribution over words (the "fiscal policy" topic puts high probability on "budget", "tax", "deficit").
Each document is a mixture of topics (a news article might be 70% fiscal policy, 20% politics, 10% economy).

LDA's two-level structure. Each document is a mixture of topics; each topic is a distribution over words. LDA observes only the words and works backwards to infer the hidden topics and the per-document mixtures that best explain them.

LDA observes only the words — the topics and mixtures are latent (hidden). It works backwards through inference to find the set of topics, and the per-document mixtures, that best explain the words actually seen. The "Dirichlet" part is just the prior that encourages each document to be about a few topics rather than all of them, which keeps the result interpretable. The intuition is what matters: documents = mixtures of topics, topics = distributions over words, inferred from co-occurrence alone.

Reading the topics

LDA's output for each topic is a ranked list of its most probable words — topic 4 might be {patient, hospital, treatment, clinical, care}. The crucial, often-missed point: the model does not name the topics. It hands you word groups; a human reads "patient, hospital, treatment…" and labels it "healthcare". Topic modelling is a tool for assisting human interpretation, not replacing it — its value is surfacing the structure fast, and the analyst supplies the meaning. Alongside the topics you get each document's mixture, which lets you tag, filter, and trace themes across the whole collection.

How many topics? The hard choice

LDA needs you to specify the number of topics up front — and there's no objectively correct answer, exactly the "choosing k" problem from clustering. Too few and distinct themes get mashed together; too many and topics fragment into noise and near-duplicates.

The standard guide is a coherence score, which measures how semantically related a topic's top words are — do they genuinely "go together" to a human? You compute coherence across a range of topic counts and look for where it peaks. But it's a guide, not an oracle: coherence typically rises, plateaus, then declines, and the final call still rests on human judgement about whether the topics are useful. As with clustering, the number of topics is a modelling decision you have to own, not a parameter the data hands you.

NMF & the neural successors

LDA isn't the only option. Non-negative matrix factorisation (NMF) reaches similar results by a different route — factorising the document-word matrix into topic components (the same matrix-factorisation family as PCA and recommenders), often faster and sometimes crisper on short text.

The bigger shift is the modern, embedding-based approach — BERTopic and kin — which embeds documents as dense vectors that capture meaning, clusters those vectors, and derives topics from the clusters. Because it understands meaning, not just word counts, it handles synonyms and short text far better and usually yields more coherent topics — at higher computational cost. It's the same bag-of-words → embeddings progression that runs through retrieval and NLP generally.

When topics are junk

The honest caveats, because topic modelling can flatter to deceive:

Where it shows up in my work

Mapping a corpus you can't read

The recurring problem topic modelling solves is a real one in analytical and intelligence work: a large pile of unstructured text — reports, free-text responses, document collections — that's too big to read but needs to be understood. Topic modelling gives a fast map of what's in there and which documents cluster around which themes, turning an unreadable corpus into something navigable.

What keeps it honest is holding two things together: it's a human-in-the-loop tool (the model finds word groups, I supply the meaning and the labels), and the topics can be junk until validated — so it generates hypotheses to check, not conclusions to report. It ties straight to clustering (the same unsupervised idea), NLP (the text processing), and the embedding methods shared with retrieval.

Refresh in 60 seconds

The LDA generative framing, coherence-based topic-count selection, and the BERTopic comparison reflect current topic-modelling references alongside NLP coursework.