Large Language Models

Large language models — the technology behind ChatGPT, Claude, and the current wave of AI — can feel like magic or like a threat, and both reactions get in the way of using them well. The clearest way to understand them is to start from the deflating truth at their core: an LLM is a next-token predictor. Given some text, it predicts the most likely next chunk, appends it, and repeats. That's it. Every remarkable and every frustrating thing about these models follows from that one mechanism carried out at a scale that's genuinely hard to picture.

This page builds LLMs up from that mechanism — the transformer that powers it, the training that shapes it, the prompting that steers it, and the honest limits (especially hallucination) that you have to respect to use them responsibly. It draws together the deep-learning, NLP, RL, and retrieval threads from across this section.

A next-word predictor, at scale

At its heart, an LLM models the probability of the next token (a word or word-piece) given everything before it:

P(\text{token}_t \mid \text{token}_1, \text{token}_2, \dots, \text{token}_{t-1})

To generate text, it samples a token from that distribution, adds it to the input, and predicts again — one token at a time, autoregressively. The astonishing finding of the last few years is that when you train this simple objective on a large enough model with enough text, abilities you never explicitly programmed — translation, summarisation, arithmetic, apparent reasoning — emerge as a by-product. To predict the next word well across all of human writing, the model is forced to learn a great deal about the world the writing describes. The simplicity of the goal hides the depth of what achieving it requires.

The transformer engine

The architecture that made this work is the transformer, from the deep-learning page, and its key piece is self-attention. When processing each token, attention lets the model look at every other token in the context and weigh how relevant each is to the current one — so it can resolve "it" to the right noun, connect a question to its answer, and track meaning across long passages.

Two properties of attention explain why transformers, and not the older sequence models, powered this revolution: it captures long-range relationships directly (any token can attend to any other, however far apart), and it parallelises beautifully across a sequence, which is what made training on internet-scale data computationally feasible. The transformer is the engine; scale is the fuel.

Pretraining: reading the internet

Pretraining is where the model learns. It's shown a vast corpus — much of the public internet, books, code — and trained, by gradient descent, to predict the next token at every position. No labels are needed; the text is its own supervision (the next word is the answer), which is why it can consume trillions of words of raw text.

The empirical engine behind the recent leaps is the scaling laws: model capability improves predictably as you increase model size, data, and compute together. Push all three far enough and new capabilities appear — sometimes abruptly. This is also why these models are so expensive and concentrated among a few labs: pretraining a frontier model costs enormous compute. What you get out of it is a base model — fluent, knowledgeable, but raw and not yet useful as an assistant. That takes a second phase.

From base model to assistant

A raw base model just continues text — ask it a question and it might reply with more questions, because that's a plausible continuation. Turning it into a helpful, safe assistant takes two further stages:

The three training stages. Pretraining yields a fluent but raw next-token predictor; instruction tuning teaches it to follow requests; RLHF aligns it with human preferences for helpfulness and safety. Only after all three is it the assistant you interact with.

Instruction tuning (supervised fine-tuning) — further training on examples of instructions paired with good responses, teaching the model to answer rather than just continue.
RLHF (reinforcement learning from human feedback) — humans rank model outputs, a reward model learns those preferences, and the LLM is tuned to maximise that reward — nudging it toward helpful, honest, harmless answers.

In-context learning & prompting

The surprising capability that made LLMs so flexible is in-context learning: you can get the model to do a new task just by describing it (or showing a few examples) in the prompt, with no retraining. Show it nothing and ask (zero-shot), or give a couple of worked examples (few-shot), and it adapts on the fly. The model isn't learning in the training sense — its weights don't change — it's recognising the pattern of the task from the context and continuing it.

This is why prompting became a skill: how you frame the request materially changes the output. A useful trick is chain-of-thought — asking the model to "think step by step" — which often improves reasoning, because generating the intermediate steps gives it more relevant tokens to condition the answer on. It all happens within the context window: the fixed budget of tokens the model can attend to at once. Anything outside it — earlier in a long conversation, or in a document you didn't paste — simply isn't seen.

Why it confidently makes things up

The most important limitation to internalise: an LLM has no concept of truth. It generates the most plausible continuation, not the most correct one — and when a fluent-sounding falsehood is more probable than an awkward truth (or the model simply doesn't "know"), it produces the falsehood with total confidence. This is hallucination, and it's not a bug to be fully patched out — it's intrinsic to a system that models likelihood rather than facts.

Grounding it: RAG

The leading practical fix for hallucination and the knowledge cutoff is retrieval-augmented generation (RAG). Instead of relying on the model's frozen memory, you first retrieve relevant documents (via search or embeddings), paste them into the context, and ask the model to answer from those documents. The LLM becomes a reasoning-and-language layer over a trusted, current, citable knowledge source.

This is why the decades-old information-retrieval machinery suddenly sits at the centre of modern AI — the "R" in RAG is exactly that retrieval step. It dramatically reduces (though doesn't eliminate) fabrication, lets the model cite its sources, and keeps it current without retraining. It's the difference between asking a model what it remembers and asking it to read and summarise what you handed it — the latter is far more trustworthy, and the basis of most serious LLM applications.

Where it shows up in my work

A powerful tool, used with discipline

LLMs are part of the day-to-day toolkit now — drafting, summarising long documents, transforming and explaining text, writing and debugging code. The single most important thing this understanding buys is the right mental model: it's a fluent next-token predictor, not a knowledge base, so I lean on it for language work (where it excels) and verify every factual claim (where it can't be trusted), because hallucination is intrinsic, not occasional.

In a government setting that discipline is non-negotiable — a confident fabrication in a brief is worse than no answer — which is why RAG (grounding the model in real, citable documents) is the pattern that actually fits accountable work, and why it ties straight to retrieval. Knowing the machinery — the cutoff, the bias, the RLHF-driven over-confidence — is what separates using these tools critically from being misled by them.

Refresh in 60 seconds

An LLM is a next-token predictor at huge scale — $P(\text{token}_t \mid \text{token}_{1..t-1})$ . Capabilities emerge from doing that well.
The engine is the transformer + self-attention (long-range + parallelisable). Pretraining on internet-scale text (self-supervised); scaling laws drive the leaps.
Three stages: pretrain (raw) → instruction-tune (follow requests) → RLHF (align to humans — but risks sycophancy).
In-context learning: describe/show the task in the prompt (zero/few-shot, chain-of-thought) — no retraining; limited by the context window.
Hallucination is intrinsic — it models plausibility, not truth. Plus knowledge cutoff, bias, RLHF over-confidence. Verify every factual claim.
RAG grounds it: retrieve real documents → answer from them. The "R" is information retrieval — the basis of trustworthy LLM apps.

The next-token/transformer framing, the pretrain→SFT→RLHF pipeline, the intrinsic-hallucination point, and RAG-as-grounding reflect current LLM references alongside hands-on use.