Deep Learning & Neural Networks

Deep learning powers the things that feel like magic — image recognition, translation, the large language models behind today's AI. The magic dissolves, in a good way, once you see the machinery: a neural network is a big stack of simple operations — matrix multiplications with a bit of non-linearity between them — and "training" is just gradient descent nudging millions of numbers to make the errors smaller. No single piece is mysterious; the power comes from scale and how the pieces compose.

This page builds that up from the neuron, leaning on the linear algebra and calculus pages you've already got. The pay-off is the one idea that makes deep learning click: the network learns its own features instead of being handed them.

What's actually new: learning the features

Classical machine learning leans on humans to engineer good features: you decide what to measure, and the model learns weights over those hand-picked inputs. That works until the features are too subtle to name — what is the feature that distinguishes a cat from a dog in raw pixels?

Deep learning's defining move is representation learning: instead of being given features, the network learns them, layer by layer, from raw data. Early layers pick up simple patterns (edges, in an image), later layers compose those into complex ones (textures, then shapes, then faces). "Deep" just means many layers stacked, so the representations can build on each other. That's the whole reason it beats classical methods on images, audio, and language — it discovers the features we couldn't specify.

The neuron: weighted sum, then a bend

The building block is the artificial neuron. It takes inputs $x_1, \dots, x_n$ , multiplies each by a weight, adds a bias, and passes the result through a non-linear activation function $\sigma$ :

a = \sigma\!\left( \sum_{i=1}^{n} w_i x_i + b \right)

The weighted sum is just linear regression. The crucial extra is $\sigma$ , the non-linearity — and it's not optional. Without it, stacking layers is pointless: a composition of linear maps is still just one linear map, so a deep network would collapse to a single-layer one and could only ever draw straight boundaries. The activation is what lets depth buy you expressive power.

The modern default is ReLU, $\sigma(z) = \max(0, z)$ — dead simple, and its flat-or-linear shape avoids the vanishing-gradient problem that plagued the older S-shaped sigmoid in deep stacks.

A layer is a matrix multiply

A layer is just many neurons computed at once. Stack their weights into a matrix $W$ and their biases into a vector $b$ , and the whole layer is one clean expression:

\mathbf{a} = \sigma\!\left( W\mathbf{x} + \mathbf{b} \right)

This is why linear algebra is the language of deep learning, and why GPUs matter — they're built to do exactly this, enormous matrix multiplies, in parallel. A deep network just chains these: $\mathbf{a}^{(1)} = \sigma(W^{(1)}\mathbf{x} + \mathbf{b}^{(1)})$ , then $\mathbf{a}^{(2)} = \sigma(W^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)})$ , and so on to the output.

The forward pass

Running input through the chain to get a prediction is the forward pass — feed in the data, multiply-add-activate layer after layer, read off the answer at the end. With fixed weights that's all a trained network does to make a prediction. The interesting question is how those weights got to be any good, which is the rest of this page.

The training loop. Forward pass turns inputs into a prediction; the loss measures how wrong it is; backpropagation pushes that error backward to get each weight's gradient; gradient descent nudges every weight downhill. Repeat millions of times.

Loss & gradient descent

To improve, the network needs a number for how wrong it is: the loss $L$ (mean squared error for regression, cross-entropy for classification). Training is then an optimisation problem: find the weights that make $L$ as small as possible.

With millions of weights there's no formula for the minimum, so we walk toward it. Gradient descent computes the gradient of the loss with respect to every weight — the direction of steepest increase — and steps the opposite way:

w \;\leftarrow\; w - \eta\, \frac{\partial L}{\partial w}

The learning rate $\eta$ sets the step size, and it's a delicate knob: too small and training crawls or stalls in a poor spot; too large and it overshoots and diverges. In practice we use stochastic gradient descent — estimating the gradient from a small batch of examples at a time, which is far cheaper and, helpfully, the noise helps escape bad minima.

Backpropagation: the chain rule at scale

One question remains: how do you get $\partial L / \partial w$ for a weight buried deep in the stack, when the loss is computed only at the very end? Backpropagation is the answer, and it's nothing more exotic than the chain rule from calculus, applied systematically.

The error at the output is propagated backward through the network. The chain rule says the loss's sensitivity to an early weight is the product of the local sensitivities along the path from that weight to the loss:

\frac{\partial L}{\partial w^{(l)}} = \frac{\partial L}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial w^{(l)}}

By reusing the quantities it already computed for later layers, backprop gets the gradient for every weight in a single backward sweep — efficiently enough to train networks with billions of parameters. Forward pass to get the prediction, backward pass to get all the gradients, one gradient-descent step, repeat. That loop, run at scale, is deep learning.

CNNs, RNNs & transformers

The general recipe is the same; the architectures differ in how they wire the layers to match the structure of the data:

CNNs (convolutional networks) — for images. Instead of connecting every pixel to every neuron, they slide small filters across the image, sharing weights. This bakes in the idea that a feature (an edge, a texture) means the same thing wherever it appears, and slashes the parameter count.
RNNs (recurrent networks) — for sequences (text, time series). They carry a hidden state forward step by step, giving the network a memory of what came before. Powerful but hard to train over long sequences (vanishing gradients again).
Transformers — the architecture behind modern language models. Their attention mechanism lets every position look directly at every other, capturing long-range relationships without stepping through a sequence — and it parallelises beautifully, which is why it scaled to today's giant models.

Why now — and the honest limits

The core ideas are decades old. What changed was a coincidence of three things: data (the internet made huge labelled datasets), compute (GPUs made the matrix maths cheap), and tricks (ReLU, dropout, better initialisation, attention) that made deep networks actually trainable. Together they tipped deep learning from a curiosity to the dominant approach.

Where it shows up in my work

Knowing when not to go deep

In a government-analyst setting the most useful thing this understanding buys is judgement about when deep learning is the wrong tool. For the structured, tabular data most analysis runs on — and where every decision needs to be explained and defended — a transparent model usually beats an opaque deep one. Knowing what's inside the black box is what lets me say so with confidence rather than reaching for it because it's fashionable.

Where deep learning does earn its place is unstructured data — text, documents, imagery — and there the foundations here (it's matrix multiplies trained by gradient descent; it's data-hungry and opaque; transformers power the language models increasingly part of the toolkit) are exactly what's needed to use it critically rather than credulously.

Refresh in 60 seconds

A neural net is stacked matrix multiplies + non-linear activations; "deep" = many layers. Its superpower is representation learning — it learns features instead of being handed them.
A neuron: $\sigma(\sum w_i x_i + b)$ . The activation $\sigma$ (e.g. ReLU) is essential — without it, depth collapses to one linear layer.
A layer is $\sigma(W\mathbf{x}+\mathbf{b})$ (hence GPUs + linear algebra). The forward pass chains layers to a prediction.
Train by minimising a loss with gradient descent: $w \leftarrow w - \eta\,\partial L/\partial w$ . Learning rate $\eta$ is delicate; use stochastic mini-batches.
Backpropagation = the chain rule run backward to get every gradient in one sweep. Watch the vanishing gradient in deep stacks.
Families: CNNs (images), RNNs (sequences), transformers (attention → modern LLMs). Limits: data-hungry, expensive, black-box, overfits — not always the right tool.

The backprop/gradient-descent split, vanishing-gradient and learning-rate cautions reflect current deep-learning references alongside ML coursework.