Calculus & Optimisation

Here's the secret that demystifies machine learning: training a model is an optimisation problem. You define a loss — a single number for how wrong the model is — and then you search for the settings that make it as small as possible. Calculus is the tool that makes that search possible, because the derivative tells you, from any point, which direction reduces the loss.

This page completes the core maths foundation alongside linear algebra (the shape of data) and probability and statistics (its uncertainty). Calculus is the third leg: the maths of change and of finding the best answer.

Two questions, one toolkit

Calculus answers two questions that turn out to be deeply linked:

How fast is something changing? — the realm of the derivative. The slope of a curve, the speed of a process, the sensitivity of an output to an input.
Where is the best (highest or lowest) point? — the realm of optimisation. The peak of a profit curve, the bottom of a loss surface.

They're linked because the best point is exactly where the rate of change hits zero — at the very top of a hill or bottom of a valley, the slope is momentarily flat. So if you can compute slopes, you can find optima. That single bridge is the whole reason calculus runs machine learning.

The derivative

The derivative of a function measures its instantaneous rate of change — the slope of the curve at a point. Formally it's the limit of "rise over run" as the run shrinks to nothing:

f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}

The intuition matters more than the limit: zoom in on any smooth curve far enough and it looks like a straight line — the derivative is that line's slope. A positive derivative means the function is rising; negative means falling; zero means flat, which is the signal you're at a peak, a valley, or a plateau. That last fact is the one optimisation hangs everything on.

The gradient

Real models don't have one knob; they have thousands or billions. When a function has many inputs, the derivative generalises to the gradient — the vector of partial derivatives, one per input, each measuring how the output changes as you nudge that one variable and hold the rest still:

\nabla f = \left[\, \frac{\partial f}{\partial x_1},\ \frac{\partial f}{\partial x_2},\ \dots,\ \frac{\partial f}{\partial x_n} \,\right]

The gradient has a beautiful geometric meaning: it points in the direction of steepest ascent — the way you'd walk to climb the surface fastest — and its length says how steep that climb is. To go down as fast as possible, you simply walk in the opposite direction, −∇f. Hold onto that: it is the entire idea behind training.

Finding the best answer

Because the slope is flat at a peak or trough, optimisation begins by looking for points where the gradient is zero — the stationary points. Setting ∇f = 0 and solving gives the candidates. To tell which kind each one is, you check the second derivative (the curvature):

Curving up (positive) → a minimum — a valley.
Curving down (negative) → a maximum — a peak.
A mix across dimensions → a saddle point — up one way, down another, like a mountain pass.

For simple functions you can solve ∇f = 0 by hand. For the tangled loss surfaces of real models you can't — there's no closed-form answer — so you need an algorithm that walks to the minimum instead. That algorithm is gradient descent.

Convexity

The single property that decides whether optimisation is easy or hard is convexity. A convex function is bowl-shaped: it has exactly one bottom, and any local minimum is automatically the global one. A non-convex function is a mountain range of bumps — many valleys, only one of them deepest — and an algorithm can get stuck in a shallow one, mistaking a local minimum for the best answer.

Convex (left): one bowl, so the local minimum is the global minimum — gradient descent always finds it. Non-convex (right): several valleys, so descent can settle in a local minimum that isn't the deepest.

Classic methods like linear and logistic regression have convex losses, so they're guaranteed to find the best fit. Neural networks are wildly non-convex — which is why training them is part art, and why tricks like good initialisation, momentum, and randomness matter so much. Remarkably, in very high dimensions the local minima tend to be nearly as good as the global one, which is a big part of why deep learning works at all.

Gradient descent

Gradient descent is the workhorse algorithm of modern machine learning, and it's almost embarrassingly simple: from wherever you are, compute the downhill direction and take a small step that way. Repeat until you stop moving. As an update rule for the parameters θ:

\theta \leftarrow \theta - \eta\,\nabla L(\theta)

The loss L is how wrong the model is, ∇L is the uphill direction (so we subtract it to go down), and η (eta) is the learning rate — the step size, and the single most important knob to tune:

Too small and training crawls, taking forever to converge.
Too large and you overshoot the valley, bouncing across it or diverging entirely.

In practice you rarely use the whole dataset for each step — you estimate the gradient from a small random batch, which is faster and adds helpful noise that can bounce you out of bad local minima. That's stochastic gradient descent, and variants of it (Adam, RMSProp) train essentially every neural network in use today.

Gradient descent: each step moves opposite the gradient (downhill) by an amount set by the learning rate. The steps shrink as the slope flattens near the minimum.

The chain rule and backprop

To run gradient descent on a deep model you need the gradient of the loss with respect to every parameter, even those buried many layers deep. The tool that delivers it is the chain rule — calculus's rule for differentiating nested functions:

\frac{dy}{dx} = \frac{dy}{du}\cdot\frac{du}{dx}

It says the sensitivity of an output to a distant input is the product of the sensitivities along the chain between them. A neural network is exactly such a chain — each layer a function feeding the next — so the chain rule lets you compute how a weight in layer one affects the final loss, by multiplying the local derivatives along the path.

Constrained optimisation

Often you can't optimise freely — there are constraints. Maximise a portfolio's return subject to a risk budget; minimise cost subject to meeting demand. The classic tool is the method of Lagrange multipliers, which folds each constraint into the objective with a new variable that prices how much the constraint "costs" at the optimum.

This is the bridge to operations research — linear programming, resource allocation, scheduling — where the whole problem is "find the best decision within hard limits". The same gradient thinking applies, now walking the boundary of what's allowed rather than the open surface.

Where it shows up in my work

Refresh in 60 seconds

Calculus answers "how fast is it changing?" (derivative) and "where's the best point?" (optimisation) — linked because the best point has zero slope.
The gradient ∇f is the vector of partials; it points uphill (steepest ascent), so −∇f points downhill.
Optima sit where ∇f = 0; curvature (second derivative) says min, max, or saddle.
Convex = one bowl, descent always wins. Non-convex (neural nets) = many valleys, can get stuck.
Gradient descent: θ ← θ − η∇L. The learning rate η is the key knob — too small crawls, too big diverges. SGD uses random batches.
Backprop = the chain rule in reverse, computing every parameter's gradient in one backward sweep. Training = chain rule → descend → repeat.
Constrained optimisation (Lagrange) handles hard limits — the bridge to operations research.