Feature Engineering & Data Preparation

There's a well-worn saying in data science: you spend 80% of your time preparing the data and 20% complaining about it. It's a joke, but the proportion is real. The model — the bit that gets the attention — is often a few lines and an afternoon. The data preparation and feature engineering — turning messy raw records into clean, informative inputs — is where most of the effort goes, and where most of the final accuracy is won or lost.

The principle underneath is blunt: garbage in, garbage out. The most sophisticated model can't rescue bad inputs, and a simple model on well-engineered features routinely beats a fancy one on raw data. This page is the craft of that preparation, end to end — and the one mistake that quietly invalidates more analyses than any other.

The unglamorous 80%

A feature is just an input variable the model sees. Feature engineering is the work of deciding what those inputs should be and getting the raw data into that shape: fixing what's broken, transforming what's awkward, and creating what isn't there yet. It sits right after the data wrangling on the data-processing page and right before the modelling, and it's the highest-leverage stage in the whole pipeline.

Three families of work make it up, and the rest of this page is each in turn:

Clean — handle missing values, outliers, and wrong types so the data is trustworthy.
Transform — scale, normalise, and encode so each feature is in a form the model can use.
Create — combine and derive new features that expose the signal more directly.

Cleaning & missing data

Real data is missing values, and how you handle the gaps matters more than people expect — because why a value is missing changes what's safe to do. The standard taxonomy:

MCAR (missing completely at random) — the gap is unrelated to anything; the least harmful case.
MAR (missing at random) — the missingness depends on other observed variables (older people skip a question); recoverable if you account for those variables.
MNAR (missing not at random) — the missingness depends on the missing value itself (high earners don't disclose income). The dangerous case: the gap carries information, and naive filling biases the result.

Options run from dropping rows (fine if few and MCAR, biased otherwise) to imputation — filling with the mean/median, the most frequent category, or a model-based guess (KNN, regression). A useful trick: add a "was missing" indicator column, so the model can learn from the fact of absence itself — which matters most precisely in the MNAR case.

Scaling & transformations

Features arrive on wildly different scales — age in tens, income in tens of thousands. Many methods are sensitive to that, so we put features on a common footing. The most common is standardisation (the z-score): subtract the mean, divide by the standard deviation, so each feature has mean 0 and standard deviation 1:

z = \frac{x - \mu}{\sigma}

This matters enormously for any method that uses distances or magnitudes — clustering, PCA, k-NN, gradient descent. Without it, the largest-scaled feature dominates by sheer numerical size, regardless of its actual relevance. (Min-max scaling to a fixed [0, 1] range is the common alternative.)

Separately, skewed variables — income, populations, counts — often benefit from a log transform (or Box-Cox), which pulls in a long right tail toward a more symmetric, model-friendly shape. The goal throughout is the same: present each feature in the form where its signal is easiest to use.

Encoding categories

Models eat numbers, but much real data is categorical — a suburb, a status, a type. Encoding turns categories into numbers, and the method has to respect the data:

One-hot encoding — one binary column per category ("NSW" → [1, 0, 0]). The safe default for unordered categories, but it explodes the column count for high-cardinality fields.
Ordinal encoding — map ordered categories to ordered integers (low/med/high → 0/1/2). Correct only when the order is real; misuse invents a ranking that isn't there.
Target encoding — replace each category with the average target value for it. Powerful for high-cardinality fields (thousands of postcodes), but it peeks at the target, so it's a prime source of the leakage problem below if done carelessly.

Creating features: where domain knowledge pays

The most valuable step is often inventing features that expose the signal more directly than the raw data does. This is where human understanding of the problem beats any algorithm:

Date parts — a raw timestamp is nearly useless; day-of-week, month, is-weekend, or "days since last event" can be enormously predictive.
Interactions & ratios — price-per-square-metre, debt-to-income, events per day. A ratio can capture in one feature what two raw columns hide.
Binning — grouping a continuous variable into bands when the relationship isn't smooth (age brackets).
Domain features — anything your understanding of the field says should matter, made explicit so the model doesn't have to rediscover it from scratch.

Good feature creation is the closest thing to a free lunch in modelling: it's where a person who understands the problem hands the model a head start.

The cardinal sin: data leakage

Here's the mistake that quietly ruins more analyses than any other, and it hides inside the very steps above. Data leakage is when information that wouldn't really be available at prediction time sneaks into the features during training. The model looks brilliant in testing and then fails in the real world — because it was secretly peeking at answers it won't have.

Leakage vs the correct order. WRONG: scale/encode using the whole dataset, then split — the test set's statistics have bled into training. RIGHT: split first, fit every transform on training data only, then apply those fitted transforms to the test set.

The classic version: you standardise or target-encode using statistics from the whole dataset, then split into train and test. Now the mean and standard deviation carry information from the test set — the model has seen a whisper of its own exam. The fix is an iron rule: split first, then fit every transform on the training data only, and apply those fitted transforms to the test set. (This is exactly why honest evaluation and held-out testing are so insistent about order.)

Selecting features: less can be more

More features isn't always better. Irrelevant or redundant ones add noise, invite overfitting, and worsen the curse of dimensionality. Feature selection trims to the inputs that earn their place, broadly three ways:

Filter — rank features by a simple statistic (correlation with the target, mutual information) before modelling. Fast and model-agnostic.
Wrapper — try subsets and keep what improves the model (forward/backward selection). Thorough but expensive.
Embedded — let the model select as it trains (Lasso's L1 penalty drives weak coefficients to zero; tree importances). Often the sweet spot.

Where it shows up in my work

Where the real time goes

As an analyst, this is most of the job. The data arrives messy — missing fields, inconsistent categories, timestamps that need turning into something useful — and the quality of the final answer is set here, long before any model runs. Knowing the missing-data taxonomy (is this gap MNAR and therefore informative?), when to standardise, and how to encode a high-cardinality field without leaking is the difference between a result that holds up and one that silently misleads.

And the leakage rule is the one I'm most disciplined about, because it's the failure that looks like success: a model that dazzles in testing and collapses in production has almost always been fed information it won't have at decision time. Split first, fit on train only — every time. It's unglamorous, and it's where the trustworthiness of the whole analysis is decided.

Refresh in 60 seconds

The missing-data taxonomy, encoding choices, and the split-before-fit leakage rule reflect current data-preparation references alongside coursework.