Model Evaluation & Validation

Build a model and it will happily give you a number for how good it is. The trouble is that the obvious number — how well it fits the data it was trained on — is almost meaningless, and trusting it is the single most common way analysis fools itself. Model evaluation is the discipline of measuring whether a model actually works: how well it will perform on data it has never seen, which is the only performance that matters.

It's the connective tissue of the whole machine-learning section — the thing that decides whether a model, an ensemble, or a network is worth trusting. This page is how it's done properly: how to test honestly, and how to choose the metric that actually reflects what you care about — because the wrong metric can make a useless model look brilliant.

Why training error lies

A model's training error — how well it fits the data it learned from — is a flattering liar. A sufficiently flexible model can memorise the training set perfectly, scoring 100%, while having learned nothing that generalises. That's overfitting, straight from the bias-variance page, and it's why training accuracy is no guide to real performance.

What you actually care about is generalisation — performance on new, unseen data, the data the model will face in the real world. The entire apparatus of evaluation exists to estimate that honestly, and it all rests on one iron principle: test on data the model has never seen during training.

The sacred test set

The foundational move is to split your data into parts that never mix:

Training set — the model learns from this.
Validation set — used to tune choices (which model, which hyperparameters) and compare options.
Test set — touched once, at the very end, for a final honest estimate of real-world performance.

Cross-validation: every row gets a turn

A single train/validation split wastes data and is at the mercy of which rows happened to land where. k-fold cross-validation fixes both: split the data into k equal folds, then train k times, each time holding out a different fold for validation and training on the rest. Average the k scores for a far more stable, trustworthy estimate — and every row gets used for both training and validation, just never at the same time.

5-fold cross-validation. The data is split into 5 folds; each round holds out one fold (red) for validation and trains on the other four. Average the five scores. Every row is validated exactly once — a stable estimate that wastes no data.

For imbalanced classes, use stratified k-fold, which keeps each fold's class ratio the same as the whole — otherwise a rare class might be absent from some folds entirely. And for time series, never shuffle: use forward-chaining (train on the past, validate on the future) so you don't leak tomorrow into today.

Scoring regression

For predicting a number, the common metrics measure how far predictions sit from the truth:

MAE (mean absolute error) — the average size of the error, in the original units. Easy to interpret, robust to outliers.
RMSE (root mean squared error) — squares the errors before averaging, so it punishes large errors harder. Use it when big misses are especially bad.
R² — the fraction of variance explained, from 0 to 1; a scale-free sense of how much better than just predicting the mean.

MAE vs RMSE isn't a detail — it encodes how you feel about big errors, and the model you pick can differ depending on which you optimise.

The confusion matrix: why accuracy lies

For classification, the temptation is to report accuracy — the fraction correct. On imbalanced data, accuracy is dangerously misleading: if 99% of cases are negative, a model that always says "negative" scores 99% accuracy and catches nothing. That same base-rate trap haunts fraud, disease, and anomaly detection alike.

The honest starting point is the confusion matrix, which splits predictions into four cells: true positives, true negatives, false positives (false alarms) and false negatives (misses). Almost every useful metric is built from these four, and the key realisation is that a false positive and a false negative usually have very different costs — so you need metrics that tell them apart.

Precision, recall & the ROC curve

The two metrics that matter most pull in different directions:

Precision — of everything flagged positive, how much really was? (Punishes false alarms.) $\text{TP} / (\text{TP} + \text{FP})$ .
Recall — of everything that truly was positive, how much did you catch? (Punishes misses.) $\text{TP} / (\text{TP} + \text{FN})$ .

There's a tug-of-war between them: flag more aggressively and recall rises but precision falls, and vice versa. The F1 score — their harmonic mean — summarises the balance in one number:

F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}

Most classifiers output a probability, and where you set the threshold decides the precision/recall balance. The ROC curve plots the true-positive rate against the false-positive rate across all thresholds, and the AUC (area under it) summarises the model's ranking ability in a single threshold-free number — 0.5 is random, 1.0 is perfect. For heavily imbalanced problems the precision-recall curve is often more informative than ROC. The lesson throughout: choose the metric that matches the real cost of being wrong, not whatever looks highest.

Honest probabilities: calibration

One dimension that's easy to forget: a model can rank cases perfectly (great AUC) while its probabilities are dishonest. Calibration asks a different question — when the model says "70% likely", does it actually happen about 70% of the time?

This matters enormously whenever the probability itself drives a decision — a risk score, an expected cost, a threshold for action. A confidently miscalibrated model (saying 95% when it's really 60%) leads to bad calls even if its ranking is fine. It's checked with a reliability diagram and fixed with methods like Platt scaling or isotonic regression — and it's the part of evaluation people most often skip.

Where it shows up in my work

Trusting — and defending — a model's score

When a model's performance has to be reported or acted on, this is where I make sure the number is real. The discipline that earns its keep daily: never trust training accuracy, keep the test set sacred (a tuned-on-test score is the failure that looks like success), and above all pick the metric that matches the cost — accuracy is meaningless on the imbalanced problems that dominate intelligence and integrity work, where a missed case and a false alarm carry very different prices.

It's also a critical-reading tool: when someone reports a model is "95% accurate", the right questions are accurate on what split, and is the data imbalanced? Knowing the difference between precision, recall, AUC, and calibration is what lets me tell a genuinely good model from a flattering one — and defend the distinction. It ties straight to honest evaluation and inference across this section.

Refresh in 60 seconds

The sacred-test-set principle, stratified cross-validation, the accuracy-on-imbalance trap, and the often-skipped calibration step reflect current model-evaluation references alongside coursework.