/knowledge/pca-dimensionality-reduction
PCA & Dimensionality Reduction
Most high-dimensional data secretly lives on a lower-dimensional surface. PCA finds that surface — the few directions that carry the signal — and lets you throw the rest away with almost no loss.
- Studied
- Multivariate Statistics — PCAMaster of Data Science (83/H1)
- When
- UniMelb, 2023–2024
- Applied in
- Pre-clustering · compression · EDA
- Read / Refreshed
- ~16 min read2026-06-25
Real datasets are wide — hundreds or thousands of columns — but most of those columns are correlated, redundant, or noise. Dimensionality reduction compresses that width down to a handful of directions that capture what actually varies, and Principal Component Analysis (PCA) is the classic, linear way to do it. It's where the linear algebra of eigenvectors and the statistics of variance meet and become genuinely useful.
The payoff is everywhere: faster models, plots you can actually see, de-noised data, and a cure for the correlated-feature problems that break regression. This page builds PCA from the ground up — why high dimensions hurt, what "principal component" really means, and exactly how the maths finds them.
01
The curse of dimensionality
High-dimensional space is deeply unintuitive, and it works against you in ways that have a name: the curse of dimensionality. As you add features, the volume of the space grows exponentially, so your data points become hopelessly sparse — everything is far from everything else, and the notion of "nearby" that powers clustering and nearest-neighbours quietly breaks down.
Concretely, more dimensions mean:
- Sparsity — you'd need exponentially more data to densely cover the space, so models have little to learn from.
- Distance concentration — in high dimensions, the nearest and farthest points end up almost equidistant, so similarity becomes meaningless.
- Overfitting and cost — more features give a model more ways to fit noise, and everything runs slower.
The saving grace is that real data rarely fills its space. Pixels in a face photo, answers on a survey, sensor readings — they're heavily correlated, so the data actually clusters on a much lower-dimensional surface inside the high-dimensional box. PCA's job is to find that surface.
02
The core idea: variance is signal
PCA rests on one assumption: the directions in which the data varies most are the most informative. A feature that's the same for every point tells you nothing; a feature that spreads points far apart carries information that distinguishes them. So PCA looks for new axes — ordered by how much the data varies along them — and keeps only the top few.
These new axes, the principal components, have two defining properties: each one points along the direction of maximum remaining variance, and they're all orthogonal (mutually perpendicular, hence uncorrelated). The first captures the most spread, the second the most of what's left, and so on. Keep the first two or three and you've kept the bulk of the structure in a form you can plot and compute with cheaply.
03
Variance and the covariance matrix
To find directions of maximum variance you first need to measure how the features vary together. Start by centring the data — subtract each feature's mean so the cloud sits at the origin. Then the covariance matrix summarises all the pairwise relationships: for a centred data matrix with rows,
C is symmetric and d×d (one row/column per feature). Its diagonal holds each feature's variance; off-diagonals hold covariances.
Each diagonal entry is the variance of feature ; each off-diagonal is the covariance between features and — positive if they rise together, negative if one rises as the other falls. This one matrix encodes the entire shape of the data cloud, and the principal directions are hiding inside it.
04
Principal components
Here's the elegant result that makes PCA work: the principal components are exactly the eigenvectors of the covariance matrix, and each one's eigenvalue is the variance captured along it.
Sort the eigenvectors by their eigenvalues, largest first, and you have your new axes in order of importance: (the first principal component) is the direction of greatest variance, the next, and so on — each automatically orthogonal to the rest because a symmetric matrix's eigenvectors always are. Maximising variance over the data is solving this eigenvalue problem; that's the whole theorem in one line.
05
The SVD route
In practice you rarely form the covariance matrix at all — you run the Singular Value Decomposition on the centred data directly, because it's more numerically stable:
The columns of are precisely the principal components, and the squared singular values in are proportional to the eigenvalues — so the SVD hands you the components and their variances in one stable step. This is the same "rotate–stretch–rotate" decomposition from the linear algebra page; PCA is one of its most important applications.
06
How many components to keep
Reduction means choosing where to cut. The standard tool is the proportion of variance explained: each component's eigenvalue as a share of the total tells you how much information it carries.
Plot the eigenvalues in descending order and you get a scree plot: it usually drops steeply then flattens, and the "elbow" marks where extra components stop earning their keep. A common rule is to keep enough components to retain 90–95% of the total variance — often a startlingly small number, because real data is so correlated.
07
Projecting and reconstructing
Once you've chosen the top components, stack them as columns of a matrix and project your data onto them — a simple matrix multiply that turns each -dimensional row into numbers:
is your compressed dataset — same rows, far fewer columns, each new column an uncorrelated principal-component score. You can also run it backwards, , to reconstruct an approximation of the original data from the few components you kept. The gap between and is exactly the variance you discarded — which is why, for image or data compression, keeping the top components stores almost the whole picture in a fraction of the numbers.
08
What PCA can't do
PCA is powerful but it has real blind spots, and knowing them is what stops you misusing it:
- It's linear. PCA only finds flat (linear) structure. Data curled onto a curved manifold (a spiral, an S-curve) defeats it — that's when you reach for non-linear methods like t-SNE, UMAP, or kernel PCA.
- Components aren't interpretable. A principal component is a blend of all original features, so "PC1" rarely maps to a meaningful real-world quantity. You trade interpretability for compactness.
- Variance isn't always relevance. PCA assumes the high-variance directions matter most, but for a classification task the signal separating classes can live in a low-variance direction PCA throws away. (That's what LDA is for.)
- It's scale-sensitive — the standardisation caveat from above.
09
Where it shows up in my work
10