MLOps & Model Monitoring

Training a model that scores well is the part everyone learns. Getting it into the real world and keeping it useful is the part that decides whether any of that effort matters — and it's far harder. A model isn't a finished artifact like a report; it's a living thing whose accuracy decays over time as the world it was trained on drifts away. MLOps (machine-learning operations) is the discipline of deploying, monitoring, and maintaining models in production so they keep doing their job.

This is the deployed-model companion to the reproducibility page: that one is about making the analysis re-runnable; this is about keeping a live model trustworthy after it ships. It matters anywhere a model informs ongoing decisions rather than a one-off answer — and the central, easily-missed truth is that deployment is the start of the work, not the end.

The last-mile gap

There's a well-known, sobering statistic in the field: a large share of models that get built never make it into production at all. The gap between "it works in my notebook" and "it runs reliably, serves real users, and stays accurate" is enormous, and it's mostly engineering and operations rather than modelling. MLOps is the set of practices — borrowed from software's DevOps — that close that gap.

The mindset shift is the important part: a deployed model is a system to be operated, not a result to be filed. It needs versioning, testing, monitoring, and a plan for the day its performance slips — because that day is coming.

The model lifecycle is a loop

The defining idea of MLOps is that a model's life isn't a line ending at deployment — it's a loop: train, deploy, monitor, and (when it decays) retrain, around and around. Deployment isn't the finish; it's one station on a cycle that keeps turning for as long as the model is in use.

The MLOps loop. Train → deploy → monitor → and when monitoring detects drift or decay, retrain and redeploy. Unlike a one-off analysis, a live model runs this cycle continuously; the monitor is what triggers the next turn.

Ways to deploy

Getting a model to where it can make predictions takes a few common shapes, and the choice depends on how the predictions are used:

Batch — run the model on a schedule over a pile of data (score every case overnight). Simple and robust; fine when predictions aren't needed instantly.
Real-time / API — wrap the model in a service that answers one request at a time, on demand. Needed when a decision happens live, but more moving parts.
Shadow deployment — run a new model alongside the old one, comparing its predictions without acting on them, to build confidence before the switch.

Why models rot: drift

Here's the fact that makes monitoring non-negotiable: a model's accuracy decays over time, even though the model itself never changes. It was trained on a snapshot of the world, and the world moves on. This is drift, and it comes in two flavours worth telling apart:

Data drift — the input distribution shifts. New kinds of customers, a changed process, a different season — the data flowing in no longer looks like the training data, even if the underlying relationships hold.
Concept drift — the relationship between inputs and the target changes. What predicted fraud last year doesn't this year because the fraudsters adapted. The rules of the game itself have moved, which is the more dangerous kind.

Both quietly erode performance, and neither shows up unless you're watching for it. A model that was excellent at launch can be quietly worthless a year later — connecting directly to the model-staleness warning from the time-series page and the evolving-target problem from anomaly detection.

What to monitor

Monitoring an ML system means watching more than whether the server is up. The layers, from easiest to most valuable:

Operational health — latency, errors, uptime. Standard software monitoring; necessary but not sufficient.
Input distributions — watch the incoming features for data drift. This is the earliest warning, available immediately, before you even know if predictions went wrong.
Predictions — track the distribution of what the model outputs; a sudden shift is a red flag.
Outcomes — the gold standard: compare predictions to what actually happened. The catch is label lag — the truth often arrives weeks or months later (did the flagged case really turn out to be fraud?), so accuracy can only be confirmed in arrears.

Training-serving skew

A subtle, common production bug: training-serving skew — the data the model sees in production is processed differently from the data it trained on. A feature computed one way in the training notebook and another way in the live service means the model is, in effect, being fed inputs it never learned from, and it underperforms for reasons that have nothing to do with the model itself.

The standard defence is a feature store — a single, shared definition of each feature used identically for both training and serving, so the two can't drift apart. It's the production cousin of the leakage and reproducibility disciplines: the same transformation, applied the same way, every time.

When to retrain

Drift's answer is retraining on fresh data — but when? Two strategies, often combined:

Scheduled — retrain on a fixed cadence (monthly, quarterly). Simple and predictable, but may retrain needlessly or too late.
Triggered — retrain when monitoring detects drift or a performance drop crossing a threshold. More responsive, and the direction modern MLOps favours — the monitor itself decides when the next turn of the loop begins.

Where it shows up in my work

Keeping a deployed model honest

Any analytical model that informs ongoing decisions — rather than answering a question once — lives or dies on this. In a government setting that makes monitoring a matter of trustworthiness, not just engineering hygiene: a model quietly drifting out of accuracy is making worse and worse calls while still looking authoritative, and the only defence is watching the inputs and outcomes deliberately. The data-vs-concept drift distinction tells me whether the inputs have shifted or the world's rules have, which points to different fixes.

It's the operational bookend to the rest of this section: the evaluation that proved the model good at launch has to be re-run as it ages, the reproducible pipeline is what makes a clean retrain possible, and a human in the loop with a rollback path keeps the automation accountable. A model you deploy and forget is a liability waiting to surface.

Refresh in 60 seconds

The data-vs-concept-drift distinction, input-distribution monitoring, training-serving skew, and triggered-retraining practice reflect current MLOps references alongside hands-on work.