Skip to content
← Knowledge

/knowledge/linear-algebra

Linear Algebra

The grammar underneath every model. Data is vectors, transformations are matrices — and once you see it that way, machine learning stops being magic and starts being geometry.

Studied
Linear AlgebraBachelor of Science · Data Science core
When
UniMelb, 2019–2022
Applied in
PCA · embeddings · regression
Read / Refreshed
~15 min read2026-06-24

Almost everything in data science is, underneath, linear algebra. A dataset is a matrix. A row is a vector. Training a linear model solves a system of equations. The word embeddings from the NLP page are vectors whose angles encode meaning. PCA, recommendation engines, the attention mechanism in a Transformer — all of it is built from a small set of operations on vectors and matrices.

This is the foundation page I'd hand my past self before any machine learning. The goal isn't to push symbols around; it's to build the geometric intuition that makes the rest click — vectors as arrows, matrices as transformations, and the few decompositions that quietly run modern data science.

01

Why it's the language of data

Organise any dataset into a table — rows are examples, columns are features — and you have a matrix. One row (one customer, one document, one image flattened out) is a vector: an ordered list of numbers, equivalently a point or an arrow in space. A dataset of 1,000 examples with 20 features is a 1000×20 matrix; each example lives as a point in 20-dimensional space.

That reframing is the whole payoff. "Find similar customers" becomes "find nearby points". "Reduce 20 features to 2" becomes "project onto a plane". "Fit a linear model" becomes "solve a system". Linear algebra is just the toolkit for measuring, moving, and simplifying points in space — and data is points in space.

02

Vectors and vector spaces

A vector is an ordered list of numbers, written as a column. Geometrically it's an arrow from the origin to a point. You can do two things to vectors, and everything else is built from them:

  • Add them — tip to tail ([1,2] + [3,1] = [4,3]).
  • Scale them by a number (a scalar) — stretch or flip (2·[1,2] = [2,4]).

Combine those — scale several vectors and add the results — and you get a linear combination. The set of all linear combinations of some vectors is their span. A basis is a minimal set of vectors whose span is the whole space; the number of them is the dimension. The familiar 3D space has the basis x, y, z — three independent directions, and every point is a unique combination of them.

03

Dot product, norms, projection

The dot product multiplies two vectors element-wise and sums the result — turning two vectors into a single number that measures how much they point the same way.

a · b = Σᵢ aᵢbᵢ = ‖a‖ ‖b‖ cos θ

From it you get two essentials. The norm (length) of a vector is ‖a‖ = √(a · a) — the Pythagorean distance. And rearranging the formula gives the angle between two vectors, which is exactly cosine similarity:

cos θ = (a · b) / (‖a‖ ‖b‖)

This is the same cosine similarity that compares word embeddings: meaning becomes geometry, and "related" becomes "small angle". When the dot product is zero the vectors are orthogonal — at right angles, sharing nothing.

baproj of a onto b
Projection of a onto b: the dot product measures how much of a points along b. Orthogonal vectors (right angle) have a dot product of zero.

04

Matrices as linear maps

Here's the idea that unlocks everything: a matrix is a function that transforms space. Multiplying a vector by a matrix moves it — rotating, stretching, shearing, or projecting it — while keeping the grid flat and the origin fixed.

The trick to reading a matrix: its columns are where the basis vectors land. A 2×2 matrix's first column says where [1,0] goes and its second column says where [0,1] goes. Because any vector is a combination of the basis, knowing where the basis lands tells you where everything lands:

[ 2 0 ; 0 3 ] · [ x ; y ] = [ 2x ; 3y ]

That matrix stretches the x-direction by 2 and the y-direction by 3. Swap in different numbers and you get rotation, reflection, or a shear — the same single operation, "apply the linear map", every time.

05

Matrix multiplication

Matrix multiplication looks like an arbitrary rule when you first meet it — rows times columns, sum the products. It isn't arbitrary at all: multiplying two matrices is composing their transformations. AB means "do B, then do A" — the same as nesting functions f(g(x)).

That one insight explains the rest of the rules:

  • Dimensions must line up (the inner sizes match) because the output of one transformation has to be a valid input to the next.
  • Order mattersAB ≠ BA in general — because rotating then stretching is not the same as stretching then rotating.
  • The identity matrix I (ones on the diagonal) is the "do nothing" map; AI = A.

06

Systems, rank, invertibility

A system of linear equations is just Ax = b: given a transformation A and a target b, find the input x that lands on it. Solving the system is running the transformation in reverse.

Whether you can reverse it depends on the rank — the number of genuinely independent directions in the matrix (the dimension of its column span). If a matrix squashes space into a lower dimension — say a 3D map that flattens everything onto a plane — it has lost information and can't be undone. Two key cases:

  • Full rank (independent columns): the map is reversible, an inverse A⁻¹ exists, and Ax = b has exactly one solution, x = A⁻¹b.
  • Rank-deficient (some columns are redundant): the map collapses dimensions, no inverse exists, and the system has either no solution or infinitely many. In data terms, redundant columns mean collinear features — a real and common headache in regression.

07

Eigenvalues and eigenvectors

Most vectors get knocked off their line when you apply a matrix — they change both length and direction. But for any given transformation, a few special vectors keep pointing the same way and are merely scaled. Those are the eigenvectors, and the scaling factor is the eigenvalue:

A v = λ v

Read it as: applying the transformation A to v does the same thing as simply stretching v by the number λ. Eigenvectors are the transformation's "natural axes" — the directions it acts on most simply. An eigenvalue of 2 means that direction is doubled; 1 means it's unchanged; a negative one means it's flipped.

This matters for data because the covariance matrix of a dataset has eigenvectors that point along the directions of greatest variance — the axes the data actually spreads along. That's the engine of PCA, and it's a short step from there to the SVD.

08

The SVD — the crown jewel

The Singular Value Decomposition is the result everything else has been building toward. It says any matrix at all — square or not — can be broken into three simple pieces:

A = U Σ Vᵀ

Every linear map, however tangled it looks, is really just a rotation (Vᵀ), a stretch along the axes (Σ), and another rotation (U). The diagonal of Σ holds the singular values — how much the map stretches along each direction, in descending order of importance.

A=UΣVᵀrotaterotate
The SVD factors any m×n matrix A into a rotation U, a diagonal stretch Σ (singular values, largest first), and a rotation Vᵀ. Keeping only the largest few singular values gives the best low-rank approximation of A.

The reason the SVD is everywhere: keep only the largest few singular values and you get the best possible low-rank approximation of the matrix — the most information in the fewest numbers. That single idea powers:

  • PCA and dimensionality reduction — compress 100 correlated features into the 5 directions that carry the signal.
  • Image and data compression — store a big matrix as a few small ones with almost no visible loss.
  • Recommendation systems — factor a sparse user-by-item ratings matrix into latent taste vectors.
  • Noise reduction and latent semantics — the small singular values are usually noise; drop them and the structure remains.

09

Where it shows up in my work

10

Refresh in 60 seconds