Skip to content
← Knowledge

/knowledge/natural-language-processing

Natural Language Processing

How you get a machine to read. From counting words to attention — the path from a bag of tokens to a model that holds a sentence in its head.

Studied
Natural Language ProcessingCOMP90042 · Master of Data Science
When
UniMelb, 2024
Applied in
Climate Fact-Checker
Read / Refreshed
~14 min read2026-06-24

Language is the messiest data we routinely ask computers to handle. A spreadsheet column is already a number; a sentence is a sequence of symbols whose meaning depends on order, context, tone, and a mountain of shared assumptions the writer never states. Natural Language Processing (NLP) is the field that bridges that gap — turning text into something a model can compute over, and turning a model's output back into language a person can use.

This page walks the whole arc, the same one I learned at the University of Melbourne: from the oldest trick in the book (count the words) to the architecture behind every modern language model (pay attention to the right words). Each step exists to fix a specific weakness in the step before it.

01

What NLP is, and why it's hard

NLP covers any task where the input or output is human language: classifying a review as positive or negative, pulling the names of companies out of a contract, translating Mandarin to English, answering a question, summarising a report, or generating the next word in a sentence. What unites them is that the raw material — text — resists the tidy assumptions most statistics rely on.

Four difficulties show up again and again:

  • Ambiguity. "I saw her duck" is two different sentences depending on whether duck is a bird or an action. Humans resolve this without noticing; a model has to be given enough context to do the same.
  • Sparsity. The number of possible sentences is effectively infinite, so most word combinations you'll ever meet were never in your training data. Good methods generalise from what they've seen to what they haven't.
  • Order and long-range dependence. "The dog that chased the cat that ran across the road was fast" — the verb agrees with a noun ten words back. Meaning lives in structure, not just in the bag of words present.
  • The symbol grounding gap. Words are discrete symbols with no built-in notion of similarity. Nothing about the strings cat and kitten tells a computer they're related. Much of NLP's progress is really about manufacturing a useful notion of similarity.

Keep those four in mind — every technique below is an answer to one or more of them.

02

The classic pipeline

Before any modelling, raw text is cleaned and chopped into units. This preprocessing is unglamorous but it sets the ceiling on everything downstream — a model can only be as good as the tokens you feed it.

Raw textTokeniseNormaliseRepresentModelOutput
The traditional NLP pipeline. Modern end-to-end models fold several of these steps inside the network, but the conceptual stages still hold.

Tokenisation

Tokenisation splits a string into units — usually words, but increasingly subwords. Splitting on spaces seems obvious until you hit "don't", "U.S.A.", hyphenates, emoji, or Chinese, which has no spaces between words at all. Modern systems mostly use subword schemes like Byte-Pair Encoding that learn a vocabulary of frequent fragments, so a rare word like tokenisation becomes token + isation. This keeps the vocabulary small while still representing any word, and it's a direct answer to the sparsity problem.

Normalisation

Once you have tokens you usually shrink the variation that doesn't matter for your task:

  • Case foldingAppleapple (careful: it loses the company-vs-fruit distinction).
  • Stemming chops suffixes crudely (running run, studiesstudi); lemmatisation uses a dictionary to map to the real root (bettergood). Lemmatisation is slower but correct.
  • Stop-word removal drops high-frequency, low-information words (the, of, is) — helpful for keyword methods, harmful for anything where grammar carries meaning.

03

Turning words into numbers

Models need vectors, not strings. The first family of answers treats a document as a bag of words — a count of which terms appear, ignoring order entirely.

Raw counts over-reward common words, so the standard fix is TF-IDF (term frequency × inverse document frequency). It scores a term highly when it's frequent in this document but rare across the collection — exactly the words that make a document distinctive.

tf-idf(t, d) = tf(t, d) · log( N / df(t) )

Here tf(t, d) is how often term t appears in document d, N is the total number of documents, and df(t) is how many documents contain t. A word in every document (like the) gets log(N/N) = 0 and is automatically ignored; a word in one document out of thousands gets a large weight. n-grams (pairs or triples of adjacent words, like not_good) claw back a little of the word order that the bag-of-words threw away.

TF-IDF is fast, transparent, and still a genuinely strong baseline for document classification and search. Its weakness is the symbol grounding gap: car and automobile are as unrelated as car and banana, because each word is its own independent dimension.

04

Word embeddings

The breakthrough that fixed grounding was the distributional hypothesis: a word's meaning is captured by the company it keeps. Words that appear in similar contexts — tea and coffee — should have similar representations.

Word embeddings turn this into geometry. Each word becomes a dense vector of a few hundred numbers, learned so that words used in similar contexts land near each other. word2vec learns these by training a tiny network to predict a word from its neighbours (CBOW) or its neighbours from the word (skip-gram); GloVe factorises a global co-occurrence matrix to the same end. The famous result is that meaning becomes arithmetic:

vec("king") − vec("man") + vec("woman") ≈ vec("queen")

The gender relationship is encoded as a consistent direction in the space. Cosine similarity — the angle between two vectors — becomes a usable measure of how related two words are, which is precisely the similarity notion TF-IDF lacked.

The catch: classic embeddings are static. bank has one vector whether it's a river bank or a savings bank. Fixing that needs a model that reads the whole sentence — which brings us to sequences.

05

Sequence models: RNNs and LSTMs

To respect word order, a recurrent neural network (RNN) reads one token at a time and carries a hidden state forward — a running summary of everything seen so far. In principle that lets the network condition each word on all the words before it.

In practice, plain RNNs forget. Training them means multiplying gradients through every time step, and those products shrink toward zero over long distances — the vanishing gradient problem. The network can't learn that a verb agrees with a subject twenty words back.

The Long Short-Term Memory (LSTM) network fixes this with a separate cell state and a set of gates — small learned valves that decide what to forget, what to add, and what to read out at each step. Information can now flow along the cell state almost untouched across long spans, so LSTMs capture much longer dependencies. For years they were the default for translation, speech, and tagging.

But they still have two structural limits: they read strictly left-to-right (so each step waits for the last, making them slow to train), and even with gates, a single fixed-size state is a bottleneck for very long inputs. Both fall to the next idea.

06

Attention and the Transformer

Attention is the insight that you don't need to cram a whole sentence into one running state. Instead, when processing a given word, let it look directly at every other word and pull in the ones that matter. For "it" in "the trophy didn't fit in the suitcase because it was too big", attention lets it reach back and weight trophy heavily.

The 2017 paper Attention Is All You Need threw out recurrence entirely and built a model — the Transformer — from attention alone. Each word emits three vectors: a query (what am I looking for?), a key (what do I offer?), and a value (what do I pass on?). A word's new representation is a weighted sum of all values, where the weights come from how well its query matches each key:

Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V

The Q·Kᵀ term scores every word against every other word; dividing by √dₖ keeps those scores numerically stable; the softmax turns them into weights that sum to one; multiplying by V mixes the values accordingly. Because this compares all positions at once, the whole sequence is processed in parallel rather than one step at a time. Two more pieces make it work:

  • Multi-head attention. Several attention mechanisms run in parallel, each free to focus on a different kind of relationship — one head tracks syntax, another tracks coreference — and their outputs are combined.
  • Positional encoding. Attention alone is order-blind, so a signal encoding each token's position is added to its embedding, restoring word order.

This solves the static-embedding problem too: in a Transformer, bank gets a different representation in "river bank" than in "central bank", because its vector is built from the surrounding context every time. These are contextual embeddings, and they're why the architecture took over the field.

07

Pretraining and large language models

Transformers unlocked a training recipe that now dominates NLP: pretrain then fine-tune. First train a large model on a mountain of unlabelled text with a self-supervised objective — predict a masked-out word, or predict the next word. No human labels needed, so it can learn from essentially the whole web. Then adapt that general model to a specific task with a comparatively tiny labelled dataset.

Two families came out of this:

  • Encoders (BERT-style) read the whole sentence at once, left and right, and are trained by masking words. They're built for understanding — classification, named-entity recognition, retrieval.
  • Decoders (GPT-style) read left-to-right and are trained to predict the next token. They're built for generation, and scaling them up — more parameters, more data — is what produced today's large language models.

The headline lesson of the last few years is that much of what looks like reasoning emerges from this one simple objective — predict the next token — once the model and its training data are large enough. The plumbing underneath is still tokens, embeddings, and attention.

08

How you measure it

A model is only as trustworthy as its evaluation. The right metric depends on the task.

For classification (spam / not-spam, claim supported / refuted), accuracy misleads whenever classes are imbalanced — a detector that always says "not spam" scores 99% if only 1% is spam. So you report precision (of what I flagged, how much was right), recall (of what was actually there, how much I caught), and their harmonic mean, the F1 score:

F1 = 2 · (precision · recall) / (precision + recall)

For language modelling, perplexity measures how surprised the model is by held-out text — lower is better, and it's roughly the average number of equally-likely words the model was choosing between. For generation tasks like translation or summarisation, metrics such as BLEU and ROUGE compare the output's overlapping word sequences against human references — useful but blunt, which is why human evaluation never fully goes away.

09

Where I used it

The same shape recurs in production work: a transparent baseline to set the bar and sanity-check the data, then a contextual model where the ambiguity genuinely needs resolving — and an evaluation honest enough to tell the two apart.

10

Refresh in 60 seconds