Linear Algebra

Almost everything in data science is, underneath, linear algebra. A dataset is a matrix. A row is a vector. Training a linear model solves a system of equations. The word embeddings from the NLP page are vectors whose angles encode meaning. PCA, recommendation engines, the attention mechanism in a Transformer — all of it is built from a small set of operations on vectors and matrices.

This is the foundation page I'd hand my past self before any machine learning. The goal isn't to push symbols around; it's to build the geometric intuition that makes the rest click — vectors as arrows, matrices as transformations, and the few decompositions that quietly run modern data science.

Why it's the language of data

Organise any dataset into a table — rows are examples, columns are features — and you have a matrix. One row (one customer, one document, one image flattened out) is a vector: an ordered list of numbers, equivalently a point or an arrow in space. A dataset of 1,000 examples with 20 features is a 1000×20 matrix; each example lives as a point in 20-dimensional space.

That reframing is the whole payoff. "Find similar customers" becomes "find nearby points". "Reduce 20 features to 2" becomes "project onto a plane". "Fit a linear model" becomes "solve a system". Linear algebra is just the toolkit for measuring, moving, and simplifying points in space — and data is points in space.

Vectors and vector spaces

A vector is an ordered list of numbers, written as a column. Geometrically it's an arrow from the origin to a point. You can do two things to vectors, and everything else is built from them:

Add them — tip to tail ([1,2] + [3,1] = [4,3]).
Scale them by a number (a scalar) — stretch or flip (2·[1,2] = [2,4]).

Combine those — scale several vectors and add the results — and you get a linear combination. The set of all linear combinations of some vectors is their span. A basis is a minimal set of vectors whose span is the whole space; the number of them is the dimension. The familiar 3D space has the basis x, y, z — three independent directions, and every point is a unique combination of them.

Dot product, norms, projection

The dot product multiplies two vectors element-wise and sums the result — turning two vectors into a single number that measures how much they point the same way.

\mathbf{a} \cdot \mathbf{b} = \sum_{i} a_i b_i = \|\mathbf{a}\|\,\|\mathbf{b}\|\cos\theta

From it you get two essentials. The norm (length) of a vector is ‖a‖ = √(a · a) — the Pythagorean distance. And rearranging the formula gives the angle between two vectors, which is exactly cosine similarity:

\cos\theta = \frac{\mathbf{a}\cdot\mathbf{b}}{\|\mathbf{a}\|\,\|\mathbf{b}\|}

This is the same cosine similarity that compares word embeddings: meaning becomes geometry, and "related" becomes "small angle". When the dot product is zero the vectors are orthogonal — at right angles, sharing nothing.

Projection of a onto b: the dot product measures how much of a points along b. Orthogonal vectors (right angle) have a dot product of zero.

Matrices as linear maps

Here's the idea that unlocks everything: a matrix is a function that transforms space. Multiplying a vector by a matrix moves it — rotating, stretching, shearing, or projecting it — while keeping the grid flat and the origin fixed.

The trick to reading a matrix: its columns are where the basis vectors land. A 2×2 matrix's first column says where [1,0] goes and its second column says where [0,1] goes. Because any vector is a combination of the basis, knowing where the basis lands tells you where everything lands:

\begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix}\begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 2x \\ 3y \end{bmatrix}

That matrix stretches the x-direction by 2 and the y-direction by 3. Swap in different numbers and you get rotation, reflection, or a shear — the same single operation, "apply the linear map", every time.

Matrix multiplication

Matrix multiplication looks like an arbitrary rule when you first meet it — rows times columns, sum the products. It isn't arbitrary at all: multiplying two matrices is composing their transformations. AB means "do B, then do A" — the same as nesting functions f(g(x)).

That one insight explains the rest of the rules:

Dimensions must line up (the inner sizes match) because the output of one transformation has to be a valid input to the next.
Order matters — AB ≠ BA in general — because rotating then stretching is not the same as stretching then rotating.
The identity matrix I (ones on the diagonal) is the "do nothing" map; AI = A.

Systems, rank, invertibility

A system of linear equations is just Ax = b: given a transformation A and a target b, find the input x that lands on it. Solving the system is running the transformation in reverse.

Whether you can reverse it depends on the rank — the number of genuinely independent directions in the matrix (the dimension of its column span). If a matrix squashes space into a lower dimension — say a 3D map that flattens everything onto a plane — it has lost information and can't be undone. Two key cases:

Full rank (independent columns): the map is reversible, an inverse A⁻¹ exists, and Ax = b has exactly one solution, x = A⁻¹b.
Rank-deficient (some columns are redundant): the map collapses dimensions, no inverse exists, and the system has either no solution or infinitely many. In data terms, redundant columns mean collinear features — a real and common headache in regression.

Eigenvalues and eigenvectors

Most vectors get knocked off their line when you apply a matrix — they change both length and direction. But for any given transformation, a few special vectors keep pointing the same way and are merely scaled. Those are the eigenvectors, and the scaling factor is the eigenvalue:

A\mathbf{v} = \lambda\mathbf{v}

Read it as: applying the transformation A to v does the same thing as simply stretching v by the number λ. Eigenvectors are the transformation's "natural axes" — the directions it acts on most simply. An eigenvalue of 2 means that direction is doubled; 1 means it's unchanged; a negative one means it's flipped.

This matters for data because the covariance matrix of a dataset has eigenvectors that point along the directions of greatest variance — the axes the data actually spreads along. That's the engine of PCA, and it's a short step from there to the SVD.

The SVD — the crown jewel

The Singular Value Decomposition is the result everything else has been building toward. It says any matrix at all — square or not — can be broken into three simple pieces:

A = U\,\Sigma\,V^{\top}

Every linear map, however tangled it looks, is really just a rotation (Vᵀ), a stretch along the axes (Σ), and another rotation (U). The diagonal of Σ holds the singular values — how much the map stretches along each direction, in descending order of importance.

The SVD factors any m×n matrix A into a rotation U, a diagonal stretch Σ (singular values, largest first), and a rotation Vᵀ. Keeping only the largest few singular values gives the best low-rank approximation of A.

The reason the SVD is everywhere: keep only the largest few singular values and you get the best possible low-rank approximation of the matrix — the most information in the fewest numbers. That single idea powers:

PCA and dimensionality reduction — compress 100 correlated features into the 5 directions that carry the signal.
Image and data compression — store a big matrix as a few small ones with almost no visible loss.
Recommendation systems — factor a sparse user-by-item ratings matrix into latent taste vectors.
Noise reduction and latent semantics — the small singular values are usually noise; drop them and the structure remains.

Where it shows up in my work

The base under everything

Linear algebra never shows up labelled "linear algebra" — it's the layer below the tools. When I ran PCA on multivariate data to cut dimensions before clustering, that's eigenvectors of the covariance matrix. When I fit a linear or logistic regression, the solver is solving Ax = b in disguise, and collinear features failing to converge is a rank-deficiency problem. When the Climate Fact-Checker ranked evidence by cosine similarity, that's the dot-product geometry from section 03.

Knowing the algebra underneath is what lets me debug a model instead of just rerunning it — recognising that "the regression blew up" usually means "two of my columns are telling the same story", and that "compress these features" and "find the main directions of variation" are the same SVD question.

Refresh in 60 seconds

Data is vectors (points in space); a dataset is a matrix. Linear algebra is how you measure and move points.
Dot product a·b = ‖a‖‖b‖cos θ gives length, angle, and cosine similarity. Zero = orthogonal.
A matrix is a transformation; its columns show where the basis vectors land. Multiplication = composition (AB = do B then A), so order matters.
Rank = independent directions. Full rank → invertible, one solution to Ax = b. Rank-deficient → collinear features, no clean inverse.
Eigenvectors (Av = λv) keep their direction and only scale — the natural axes of a transformation, and the basis of PCA.
The SVD (A = UΣVᵀ) breaks any matrix into rotate–stretch–rotate. Keep the top singular values → best low-rank approximation → PCA, compression, recommenders, denoising.