Skip to content
Knowledge

/knowledge/large-language-models

Large Language Models

Strip away the mystique and a language model does one thing: predict the next word. Everything else — the fluency, the apparent reasoning, the confident errors — emerges from doing that one thing at almost unimaginable scale.

Studied
Large Language ModelsAdvanced · the AI moment
When
AI & current practice
Applied in
Drafting, summarising, RAG
Read / Refreshed
~17 min read2026-06-26

Large language models — the technology behind ChatGPT, Claude, and the current wave of AI — can feel like magic or like a threat, and both reactions get in the way of using them well. The clearest way to understand them is to start from the deflating truth at their core: an LLM is a next-token predictor. Given some text, it predicts the most likely next chunk, appends it, and repeats. That's it. Every remarkable and every frustrating thing about these models follows from that one mechanism carried out at a scale that's genuinely hard to picture.

This page builds LLMs up from that mechanism — the transformer that powers it, the training that shapes it, the prompting that steers it, and the honest limits (especially hallucination) that you have to respect to use them responsibly. It draws together the deep-learning, NLP, RL, and retrieval threads from across this section.

01

A next-word predictor, at scale

At its heart, an LLM models the probability of the next token (a word or word-piece) given everything before it:

P(tokenttoken1,token2,,tokent1)P(\text{token}_t \mid \text{token}_1, \text{token}_2, \dots, \text{token}_{t-1})

To generate text, it samples a token from that distribution, adds it to the input, and predicts again — one token at a time, autoregressively. The astonishing finding of the last few years is that when you train this simple objective on a large enough model with enough text, abilities you never explicitly programmed — translation, summarisation, arithmetic, apparent reasoning — emerge as a by-product. To predict the next word well across all of human writing, the model is forced to learn a great deal about the world the writing describes. The simplicity of the goal hides the depth of what achieving it requires.

02

The transformer engine

The architecture that made this work is the transformer, from the deep-learning page, and its key piece is self-attention. When processing each token, attention lets the model look at every other token in the context and weigh how relevant each is to the current one — so it can resolve "it" to the right noun, connect a question to its answer, and track meaning across long passages.

Two properties of attention explain why transformers, and not the older sequence models, powered this revolution: it captures long-range relationships directly (any token can attend to any other, however far apart), and it parallelises beautifully across a sequence, which is what made training on internet-scale data computationally feasible. The transformer is the engine; scale is the fuel.

03

Pretraining: reading the internet

Pretraining is where the model learns. It's shown a vast corpus — much of the public internet, books, code — and trained, by gradient descent, to predict the next token at every position. No labels are needed; the text is its own supervision (the next word is the answer), which is why it can consume trillions of words of raw text.

The empirical engine behind the recent leaps is the scaling laws: model capability improves predictably as you increase model size, data, and compute together. Push all three far enough and new capabilities appear — sometimes abruptly. This is also why these models are so expensive and concentrated among a few labs: pretraining a frontier model costs enormous compute. What you get out of it is a base model — fluent, knowledgeable, but raw and not yet useful as an assistant. That takes a second phase.

04

From base model to assistant

A raw base model just continues text — ask it a question and it might reply with more questions, because that's a plausible continuation. Turning it into a helpful, safe assistant takes two further stages:

pretrainraw fluencyinstruction‑tunefollow requestsRLHFalign to humansassistantwhat you use
The three training stages. Pretraining yields a fluent but raw next-token predictor; instruction tuning teaches it to follow requests; RLHF aligns it with human preferences for helpfulness and safety. Only after all three is it the assistant you interact with.
  • Instruction tuning (supervised fine-tuning) — further training on examples of instructions paired with good responses, teaching the model to answer rather than just continue.
  • RLHF (reinforcement learning from human feedback) — humans rank model outputs, a reward model learns those preferences, and the LLM is tuned to maximise that reward — nudging it toward helpful, honest, harmless answers.

05

In-context learning & prompting

The surprising capability that made LLMs so flexible is in-context learning: you can get the model to do a new task just by describing it (or showing a few examples) in the prompt, with no retraining. Show it nothing and ask (zero-shot), or give a couple of worked examples (few-shot), and it adapts on the fly. The model isn't learning in the training sense — its weights don't change — it's recognising the pattern of the task from the context and continuing it.

This is why prompting became a skill: how you frame the request materially changes the output. A useful trick is chain-of-thought — asking the model to "think step by step" — which often improves reasoning, because generating the intermediate steps gives it more relevant tokens to condition the answer on. It all happens within the context window: the fixed budget of tokens the model can attend to at once. Anything outside it — earlier in a long conversation, or in a document you didn't paste — simply isn't seen.

06

Why it confidently makes things up

The most important limitation to internalise: an LLM has no concept of truth. It generates the most plausible continuation, not the most correct one — and when a fluent-sounding falsehood is more probable than an awkward truth (or the model simply doesn't "know"), it produces the falsehood with total confidence. This is hallucination, and it's not a bug to be fully patched out — it's intrinsic to a system that models likelihood rather than facts.

07

Grounding it: RAG

The leading practical fix for hallucination and the knowledge cutoff is retrieval-augmented generation (RAG). Instead of relying on the model's frozen memory, you first retrieve relevant documents (via search or embeddings), paste them into the context, and ask the model to answer from those documents. The LLM becomes a reasoning-and-language layer over a trusted, current, citable knowledge source.

This is why the decades-old information-retrieval machinery suddenly sits at the centre of modern AI — the "R" in RAG is exactly that retrieval step. It dramatically reduces (though doesn't eliminate) fabrication, lets the model cite its sources, and keeps it current without retraining. It's the difference between asking a model what it remembers and asking it to read and summarise what you handed it — the latter is far more trustworthy, and the basis of most serious LLM applications.

08

Where it shows up in my work

09

Refresh in 60 seconds

The next-token/transformer framing, the pretrain→SFT→RLHF pipeline, the intrinsic-hallucination point, and RAG-as-grounding reflect current LLM references alongside hands-on use.