Reinforcement Learning

Machine learning has three great paradigms. Supervised learning learns from labelled examples (here's a cat, here's a dog). Unsupervised learning finds structure with no labels at all. Reinforcement learning (RL) is the third, and the most different: it learns from reward, by doing. No one tells the agent the right answer; it tries actions, sees what happens, and gradually works out a strategy that earns the most reward over time.

It's how programs learned to beat humans at Go and Atari, how robots learn to walk, and part of how modern language models are tuned. It completes the trilogy of this section's ML pages — and its central difficulty, designing the reward, turns out to be one of the deepest problems in AI. This page builds it from the loop up.

The third paradigm: learning from reward

The defining feature of RL is that there's no labelled dataset. Instead there's a goal, and a reward signal that tells the agent how well it's doing. The agent's job is to learn a policy — a way of choosing actions — that maximises the total reward it collects over time. It learns by trial and error, the way you'd learn a game nobody explained: play, notice what scores, do more of that.

Two features make RL genuinely harder than supervised learning. First, the feedback is evaluative, not instructive — the reward tells you how good your action was, not what the right action would have been. Second, rewards can be delayed: the move that wins a chess game might have been made twenty moves earlier. Connecting a late reward to the early action that earned it — the credit assignment problem — is much of what RL is about.

The agent-environment loop

Everything in RL is built on one loop. An agent observes the current state of an environment, takes an action, and the environment responds with a reward and a new state. Repeat. The agent's whole existence is this cycle, and its goal is to choose actions that maximise reward over the long run — not just the next step.

The reinforcement-learning loop. The agent sees a state, picks an action; the environment returns a reward and the next state. Around and around — the agent learns the policy that maximises long-run reward, not just the immediate one.

The Markov decision process

The formal frame for that loop is the Markov Decision Process (MDP): a set of states $S$ , actions $A$ , transition probabilities, and rewards. Its defining assumption is the Markov property — the future depends only on the current state, not the full history of how you got there. The present state captures everything relevant.

The agent's objective is to maximise the expected return — the cumulative future reward — usually discounted so that sooner rewards count more than distant ones:

G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

The discount factor $\gamma \in [0,1)$ sets how far-sighted the agent is: near 0 it's myopic (grab reward now), near 1 it plans for the long game. That one knob captures the whole tension between short-term and long-term payoff.

Explore vs exploit: the core dilemma

RL has a tension with no equivalent in supervised learning. At each step the agent can exploit — take the action it currently believes is best — or explore — try something else that might be better, or might be worse. Exploit too much and you lock in a mediocre habit, never discovering the better option. Explore too much and you waste time on known-bad actions. Balancing the two is the exploration-exploitation trade-off.

The simplest workable answer is ε-greedy: exploit the best-known action most of the time, but with small probability $\varepsilon$ pick a random action to keep learning. The same dilemma is the whole story of the multi-armed bandit — which slot machine to pull when you only learn by pulling — and it shows up far beyond RL, in A/B testing and recommendation alike.

Value functions & the Bellman equation

To act well, the agent needs a sense of which situations are good. A value function captures exactly that: the expected long-run reward from a state (or from taking an action in a state). The action-value $Q(s,a)$ is "how much total reward can I expect if I take action $a$ in state $s$ , then act well thereafter?"

The cornerstone is the Bellman equation, which gives value a recursive structure: the value of now is the immediate reward plus the (discounted) value of where you land next.

Q(s,a) = r + \gamma \max_{a'} Q(s', a')

That recursion is the engine of nearly every RL method. It decomposes a daunting long-horizon problem — "what's the best strategy over thousands of steps?" — into a local, solvable relationship between each state and the next. Solve the Bellman equation and you know the value of everything; act greedily on those values and you have an optimal policy.

Q-learning: learning the values

You rarely know the environment's rules in advance, so you can't just solve Bellman directly — you have to learn the values from experience. Q-learning does this with a beautifully simple update. After each action, it nudges its estimate $Q(s,a)$ toward what it just observed:

Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right]

The bracket is the temporal-difference error — the gap between what the agent expected and what actually happened (the reward plus the value of where it landed). It shifts the estimate a fraction $\alpha$ (the learning rate) in that direction. Do this over and over while exploring, and the Q-values converge to the true ones — the agent learns the optimal policy without ever being told the rules of the world, purely from the rewards it collected. That last point is what makes RL feel remarkable.

Deep RL: when the state space explodes

Plain Q-learning stores a value for every state-action pair in a table — fine for a grid world, hopeless for chess or raw pixels, where the states are astronomically many. Deep reinforcement learning replaces the table with a neural network that approximates $Q(s,a)$ , generalising across similar states it has never seen.

This is the combination behind the headline results: a Deep Q-Network (DQN) learning Atari games straight from the screen, and the systems that mastered Go and beyond. Deep learning supplies the perception and generalisation; RL supplies the goal-directed decision-making — a powerful pairing, and also where RL's instabilities get most acute.

Reward hacking & honest limits

RL's greatest strength — relentlessly maximising the reward — is also its greatest danger. The agent optimises exactly what you reward, which is rarely exactly what you meant.

The practical limits are real too: RL is extraordinarily sample-hungry (it may need millions of trials, so it's mostly trained in simulation), unstable to train (small changes, wildly different outcomes), and suffers the sim-to-real gap — a policy perfect in simulation can fail on real hardware the simulator didn't capture. RL is spectacular where you can simulate cheaply and define reward cleanly, and treacherous where you can't.

Where it shows up in my work

The right frame for sequential decisions

RL is less an everyday analyst tool than a way of thinking about sequential decisions — problems where today's choice changes tomorrow's situation, and you're after long-run payoff rather than a one-shot prediction. Recognising when a problem has that shape (and when it doesn't) is the useful judgement: a great deal of analysis is better served by causal inference or a supervised model than by reaching for RL.

What carries over most is the cautionary core. The explore-exploit trade-off is the same logic as a multi-armed bandit in A/B testing. And reward hacking is the sharpest version of a lesson that runs through this whole section: optimise a proxy and you get the proxy, not the goal — a discipline that matters anywhere a metric drives behaviour, well beyond RL itself.

Refresh in 60 seconds

RL is the third paradigm: learn from reward, by doing — no labels. Feedback is evaluative and often delayed (credit assignment).
The loop: agent → action → environment → reward + next state → repeat, formalised as an MDP (Markov property; maximise discounted return $G_t = \sum \gamma^k R$ ).
Core dilemma: explore vs exploit (ε-greedy; multi-armed bandits).
Value functions + the Bellman equation $Q(s,a)=r+\gamma\max_{a'}Q(s',a')$ give long-horizon planning a recursive structure.
Q-learning learns those values from experience via the TD error — optimal policy without knowing the rules. Deep RL swaps the table for a neural net (DQN, Go, Atari).
The danger: reward hacking — it optimises what you reward, not what you meant. Plus sample-hungry, unstable, sim-to-real gap.

The Bellman/Q-learning formulation, exploration-exploitation framing, and reward-hacking caution reflect current reinforcement-learning references alongside ML coursework.