/knowledge/reinforcement-learning
Reinforcement Learning
The third way machines learn: not from labelled examples, not by finding structure, but by trying things and chasing reward. It's how a program learns to play, to control, to decide — and the hardest part is saying what 'good' even means.
- Studied
- Reinforcement LearningAdvanced · learning from reward
- When
- ML & AI coursework
- Applied in
- Sequential decisions
- Read / Refreshed
- ~16 min read2026-06-26
Machine learning has three great paradigms. Supervised learning learns from labelled examples (here's a cat, here's a dog). Unsupervised learning finds structure with no labels at all. Reinforcement learning (RL) is the third, and the most different: it learns from reward, by doing. No one tells the agent the right answer; it tries actions, sees what happens, and gradually works out a strategy that earns the most reward over time.
It's how programs learned to beat humans at Go and Atari, how robots learn to walk, and part of how modern language models are tuned. It completes the trilogy of this section's ML pages — and its central difficulty, designing the reward, turns out to be one of the deepest problems in AI. This page builds it from the loop up.
01
The third paradigm: learning from reward
The defining feature of RL is that there's no labelled dataset. Instead there's a goal, and a reward signal that tells the agent how well it's doing. The agent's job is to learn a policy — a way of choosing actions — that maximises the total reward it collects over time. It learns by trial and error, the way you'd learn a game nobody explained: play, notice what scores, do more of that.
Two features make RL genuinely harder than supervised learning. First, the feedback is evaluative, not instructive — the reward tells you how good your action was, not what the right action would have been. Second, rewards can be delayed: the move that wins a chess game might have been made twenty moves earlier. Connecting a late reward to the early action that earned it — the credit assignment problem — is much of what RL is about.
02
The agent-environment loop
Everything in RL is built on one loop. An agent observes the current state of an environment, takes an action, and the environment responds with a reward and a new state. Repeat. The agent's whole existence is this cycle, and its goal is to choose actions that maximise reward over the long run — not just the next step.
03
The Markov decision process
The formal frame for that loop is the Markov Decision Process (MDP): a set of states , actions , transition probabilities, and rewards. Its defining assumption is the Markov property — the future depends only on the current state, not the full history of how you got there. The present state captures everything relevant.
The agent's objective is to maximise the expected return — the cumulative future reward — usually discounted so that sooner rewards count more than distant ones:
The discount factor sets how far-sighted the agent is: near 0 it's myopic (grab reward now), near 1 it plans for the long game. That one knob captures the whole tension between short-term and long-term payoff.
04
Explore vs exploit: the core dilemma
RL has a tension with no equivalent in supervised learning. At each step the agent can exploit — take the action it currently believes is best — or explore — try something else that might be better, or might be worse. Exploit too much and you lock in a mediocre habit, never discovering the better option. Explore too much and you waste time on known-bad actions. Balancing the two is the exploration-exploitation trade-off.
The simplest workable answer is ε-greedy: exploit the best-known action most of the time, but with small probability pick a random action to keep learning. The same dilemma is the whole story of the multi-armed bandit — which slot machine to pull when you only learn by pulling — and it shows up far beyond RL, in A/B testing and recommendation alike.
05
Value functions & the Bellman equation
To act well, the agent needs a sense of which situations are good. A value function captures exactly that: the expected long-run reward from a state (or from taking an action in a state). The action-value is "how much total reward can I expect if I take action in state , then act well thereafter?"
The cornerstone is the Bellman equation, which gives value a recursive structure: the value of now is the immediate reward plus the (discounted) value of where you land next.
That recursion is the engine of nearly every RL method. It decomposes a daunting long-horizon problem — "what's the best strategy over thousands of steps?" — into a local, solvable relationship between each state and the next. Solve the Bellman equation and you know the value of everything; act greedily on those values and you have an optimal policy.
06
Q-learning: learning the values
You rarely know the environment's rules in advance, so you can't just solve Bellman directly — you have to learn the values from experience. Q-learning does this with a beautifully simple update. After each action, it nudges its estimate toward what it just observed:
The bracket is the temporal-difference error — the gap between what the agent expected and what actually happened (the reward plus the value of where it landed). It shifts the estimate a fraction (the learning rate) in that direction. Do this over and over while exploring, and the Q-values converge to the true ones — the agent learns the optimal policy without ever being told the rules of the world, purely from the rewards it collected. That last point is what makes RL feel remarkable.
07
Deep RL: when the state space explodes
Plain Q-learning stores a value for every state-action pair in a table — fine for a grid world, hopeless for chess or raw pixels, where the states are astronomically many. Deep reinforcement learning replaces the table with a neural network that approximates , generalising across similar states it has never seen.
This is the combination behind the headline results: a Deep Q-Network (DQN) learning Atari games straight from the screen, and the systems that mastered Go and beyond. Deep learning supplies the perception and generalisation; RL supplies the goal-directed decision-making — a powerful pairing, and also where RL's instabilities get most acute.
08
Reward hacking & honest limits
RL's greatest strength — relentlessly maximising the reward — is also its greatest danger. The agent optimises exactly what you reward, which is rarely exactly what you meant.
The practical limits are real too: RL is extraordinarily sample-hungry (it may need millions of trials, so it's mostly trained in simulation), unstable to train (small changes, wildly different outcomes), and suffers the sim-to-real gap — a policy perfect in simulation can fail on real hardware the simulator didn't capture. RL is spectacular where you can simulate cheaply and define reward cleanly, and treacherous where you can't.
09
Where it shows up in my work
10
Refresh in 60 seconds
The Bellman/Q-learning formulation, exploration-exploitation framing, and reward-hacking caution reflect current reinforcement-learning references alongside ML coursework.