PCA & Dimensionality Reduction

Real datasets are wide — hundreds or thousands of columns — but most of those columns are correlated, redundant, or noise. Dimensionality reduction compresses that width down to a handful of directions that capture what actually varies, and Principal Component Analysis (PCA) is the classic, linear way to do it. It's where the linear algebra of eigenvectors and the statistics of variance meet and become genuinely useful.

The payoff is everywhere: faster models, plots you can actually see, de-noised data, and a cure for the correlated-feature problems that break regression. This page builds PCA from the ground up — why high dimensions hurt, what "principal component" really means, and exactly how the maths finds them.

The curse of dimensionality

High-dimensional space is deeply unintuitive, and it works against you in ways that have a name: the curse of dimensionality. As you add features, the volume of the space grows exponentially, so your data points become hopelessly sparse — everything is far from everything else, and the notion of "nearby" that powers clustering and nearest-neighbours quietly breaks down.

Concretely, more dimensions mean:

Sparsity — you'd need exponentially more data to densely cover the space, so models have little to learn from.
Distance concentration — in high dimensions, the nearest and farthest points end up almost equidistant, so similarity becomes meaningless.
Overfitting and cost — more features give a model more ways to fit noise, and everything runs slower.

The saving grace is that real data rarely fills its space. Pixels in a face photo, answers on a survey, sensor readings — they're heavily correlated, so the data actually clusters on a much lower-dimensional surface inside the high-dimensional box. PCA's job is to find that surface.

The core idea: variance is signal

PCA rests on one assumption: the directions in which the data varies most are the most informative. A feature that's the same for every point tells you nothing; a feature that spreads points far apart carries information that distinguishes them. So PCA looks for new axes — ordered by how much the data varies along them — and keeps only the top few.

These new axes, the principal components, have two defining properties: each one points along the direction of maximum remaining variance, and they're all orthogonal (mutually perpendicular, hence uncorrelated). The first captures the most spread, the second the most of what's left, and so on. Keep the first two or three and you've kept the bulk of the structure in a form you can plot and compute with cheaply.

PCA on a 2D cloud. The data is correlated, so it stretches along a diagonal. PC1 is the direction of maximum variance; PC2 is orthogonal to it. Projecting onto PC1 alone keeps most of the spread — a 2D → 1D reduction with little loss.

Variance and the covariance matrix

To find directions of maximum variance you first need to measure how the features vary together. Start by centring the data — subtract each feature's mean so the cloud sits at the origin. Then the covariance matrix summarises all the pairwise relationships: for a centred data matrix $X$ with $n$ rows,

C = \frac{1}{n-1}\, X^{\top} X

C is symmetric and d×d (one row/column per feature). Its diagonal holds each feature's variance; off-diagonals hold covariances.

Each diagonal entry $C_{ii}$ is the variance of feature $i$ ; each off-diagonal $C_{ij}$ is the covariance between features $i$ and $j$ — positive if they rise together, negative if one rises as the other falls. This one matrix encodes the entire shape of the data cloud, and the principal directions are hiding inside it.

Principal components

Here's the elegant result that makes PCA work: the principal components are exactly the eigenvectors of the covariance matrix, and each one's eigenvalue is the variance captured along it.

C\,\mathbf{v}_i = \lambda_i\,\mathbf{v}_i

Sort the eigenvectors by their eigenvalues, largest first, and you have your new axes in order of importance: $\mathbf{v}_1$ (the first principal component) is the direction of greatest variance, $\mathbf{v}_2$ the next, and so on — each automatically orthogonal to the rest because a symmetric matrix's eigenvectors always are. Maximising variance over the data is solving this eigenvalue problem; that's the whole theorem in one line.

The SVD route

In practice you rarely form the covariance matrix at all — you run the Singular Value Decomposition on the centred data directly, because it's more numerically stable:

X = U\,\Sigma\,V^{\top}

The columns of $V$ are precisely the principal components, and the squared singular values in $\Sigma$ are proportional to the eigenvalues — so the SVD hands you the components and their variances in one stable step. This is the same "rotate–stretch–rotate" decomposition from the linear algebra page; PCA is one of its most important applications.

How many components to keep

Reduction means choosing where to cut. The standard tool is the proportion of variance explained: each component's eigenvalue as a share of the total tells you how much information it carries.

\text{explained}_i = \frac{\lambda_i}{\sum_{j} \lambda_j}

Plot the eigenvalues in descending order and you get a scree plot: it usually drops steeply then flattens, and the "elbow" marks where extra components stop earning their keep. A common rule is to keep enough components to retain 90–95% of the total variance — often a startlingly small number, because real data is so correlated.

A scree plot. Variance explained per component falls off fast; the 'elbow' (here after ~3 components) is where you stop — the later components are mostly noise.

Projecting and reconstructing

Once you've chosen the top $k$ components, stack them as columns of a matrix $W$ and project your data onto them — a simple matrix multiply that turns each $d$ -dimensional row into $k$ numbers:

Z = X\,W \qquad (n \times k,\ \text{with } k \ll d)

$Z$ is your compressed dataset — same rows, far fewer columns, each new column an uncorrelated principal-component score. You can also run it backwards, $\hat{X} = Z\,W^{\top}$ , to reconstruct an approximation of the original data from the few components you kept. The gap between $X$ and $\hat{X}$ is exactly the variance you discarded — which is why, for image or data compression, keeping the top components stores almost the whole picture in a fraction of the numbers.

What PCA can't do

PCA is powerful but it has real blind spots, and knowing them is what stops you misusing it:

It's linear. PCA only finds flat (linear) structure. Data curled onto a curved manifold (a spiral, an S-curve) defeats it — that's when you reach for non-linear methods like t-SNE, UMAP, or kernel PCA.
Components aren't interpretable. A principal component is a blend of all original features, so "PC1" rarely maps to a meaningful real-world quantity. You trade interpretability for compactness.
Variance isn't always relevance. PCA assumes the high-variance directions matter most, but for a classification task the signal separating classes can live in a low-variance direction PCA throws away. (That's what LDA is for.)
It's scale-sensitive — the standardisation caveat from above.