/knowledge/topic-modelling
Topic Modelling
You have ten thousand documents and no time to read them. Topic modelling reads them for you — discovering the themes running through the collection, unsupervised, without anyone ever telling it what to look for.
- Studied
- Topic ModellingAdvanced · themes in text
- When
- NLP coursework
- Applied in
- Making sense of document piles
- Read / Refreshed
- ~14 min read2026-06-26
Imagine a folder of ten thousand documents — survey responses, reports, articles, intelligence notes — that nobody has time to read. What are they about? What themes run through them, and which documents share which? Topic modelling answers exactly this: it's a family of unsupervised methods that automatically discover the latent themes ("topics") in a collection of text, without any labels and without being told in advance what to look for.
It's a genuinely distinct tool — not classification (there are no categories to predict) and not general NLP — and it's the clustering idea applied to documents. This page builds up the classic method (LDA), how to read and judge its output, and the modern embedding-based successors — along with the honest warning that the topics it finds aren't always meaningful.
01
Finding themes without labels
The defining feature is that it's unsupervised: you don't tell it the topics, it discovers them from patterns of word co-occurrence across the documents. If a set of words — "budget", "deficit", "spending", "tax" — keeps appearing together across many documents, that recurring cluster of words is a topic, which a human can then recognise as "fiscal policy". The model finds the statistical structure; the meaning is in how the words group.
This makes it the natural first move on any large, unread text collection: before you can analyse a corpus, you need to know what's in it, and topic modelling gives you that map. The output is two linked things — what topics exist (as word groups), and how much of each topic each document contains.
02
Documents as bags of words
Classic topic modelling starts from the bag-of-words representation (the same starting point as information retrieval): a document is reduced to the multiset of words it contains, ignoring order entirely. "The cat sat" and "sat the cat" look identical. That sounds lossy — and it is — but for finding themes it works surprisingly well, because a document's subject matter is carried mostly by which words appear and how often, not the order they're in. Topic modelling exploits exactly the co-occurrence patterns this representation preserves.
03
LDA: the generative story
The canonical method is Latent Dirichlet Allocation (LDA). Its clever move is to imagine a generative story for how documents get written, then run it backwards. The story has two simple ideas:
- Each topic is a distribution over words (the "fiscal policy" topic puts high probability on "budget", "tax", "deficit").
- Each document is a mixture of topics (a news article might be 70% fiscal policy, 20% politics, 10% economy).
LDA observes only the words — the topics and mixtures are latent (hidden). It works backwards through inference to find the set of topics, and the per-document mixtures, that best explain the words actually seen. The "Dirichlet" part is just the prior that encourages each document to be about a few topics rather than all of them, which keeps the result interpretable. The intuition is what matters: documents = mixtures of topics, topics = distributions over words, inferred from co-occurrence alone.
04
Reading the topics
LDA's output for each topic is a ranked list of its most probable words — topic 4 might be {patient, hospital, treatment, clinical, care}. The crucial, often-missed point: the model does not name the topics. It hands you word groups; a human reads "patient, hospital, treatment…" and labels it "healthcare". Topic modelling is a tool for assisting human interpretation, not replacing it — its value is surfacing the structure fast, and the analyst supplies the meaning. Alongside the topics you get each document's mixture, which lets you tag, filter, and trace themes across the whole collection.
05
How many topics? The hard choice
LDA needs you to specify the number of topics up front — and there's no objectively correct answer, exactly the "choosing k" problem from clustering. Too few and distinct themes get mashed together; too many and topics fragment into noise and near-duplicates.
The standard guide is a coherence score, which measures how semantically related a topic's top words are — do they genuinely "go together" to a human? You compute coherence across a range of topic counts and look for where it peaks. But it's a guide, not an oracle: coherence typically rises, plateaus, then declines, and the final call still rests on human judgement about whether the topics are useful. As with clustering, the number of topics is a modelling decision you have to own, not a parameter the data hands you.
06
NMF & the neural successors
LDA isn't the only option. Non-negative matrix factorisation (NMF) reaches similar results by a different route — factorising the document-word matrix into topic components (the same matrix-factorisation family as PCA and recommenders), often faster and sometimes crisper on short text.
The bigger shift is the modern, embedding-based approach — BERTopic and kin — which embeds documents as dense vectors that capture meaning, clusters those vectors, and derives topics from the clusters. Because it understands meaning, not just word counts, it handles synonyms and short text far better and usually yields more coherent topics — at higher computational cost. It's the same bag-of-words → embeddings progression that runs through retrieval and NLP generally.
07
When topics are junk
The honest caveats, because topic modelling can flatter to deceive:
08
Where it shows up in my work
09
Refresh in 60 seconds
The LDA generative framing, coherence-based topic-count selection, and the BERTopic comparison reflect current topic-modelling references alongside NLP coursework.