Ensemble Methods & Gradient Boosting

There's a striking result at the heart of practical machine learning: you can take a pile of mediocre models — each barely better than guessing — combine them cleverly, and end up with one of the most accurate predictors available. This is ensemble learning, and it's not a niche trick. For the structured, tabular data that most real-world analysis runs on, ensemble methods like random forests and gradient boosting are the reigning champions — they win the competitions and quietly power a great deal of production modelling.

This page builds the idea from the ground up: why a crowd of models beats an individual, the two great strategies for building that crowd (bagging and boosting), and how gradient boosting — XGBoost and its kin — became the default first thing to try on tabular data. It builds directly on the bias-variance ideas from the machine-learning page.

Why many weak models beat one strong one

The intuition is the wisdom of crowds. Ask one person to guess the number of jellybeans in a jar and they'll be off; average a thousand guesses and the answer is uncannily close — the individual errors, being partly random and independent, cancel out. Ensemble learning does exactly this with models: combine many predictors whose errors are decorrelated, and the mistakes average away while the shared signal reinforces.

The crucial word is decorrelated. Averaging a thousand identical models gains you nothing — they all make the same mistake. The whole art of ensembling is building models that are individually decent but differ from each other, so their errors don't line up. The two families below are two different answers to "how do we make them differ?"

The tree, and its useful flaw

Nearly all the famous ensembles are built from decision trees— flowcharts of yes/no splits ("is age > 40? then is income > 50k?...") that carve the data into regions and predict within each. A single tree is wonderfully interpretable and handles mixed data types without fuss.

But a single deep tree is a textbook high-variance model: it overfits badly, memorising the training data's noise, and a tiny change in the data produces a completely different tree. That instability looks like a weakness — and it's exactly what makes trees the perfect ensemble ingredient. A model that varies a lot from sample to sample is one you can average to great effect. The ensemble turns the tree's flaw into its strength.

Bagging & random forests

Bagging (bootstrap aggregating) is the first strategy: train many trees in parallel, each on a different random bootstrap sample of the data, then average their predictions (or take a majority vote). Because each tree sees slightly different data, each overfits differently — and averaging those varied overfittings cancels the noise, sharply cutting variance without adding bias.

The random forest adds one brilliant twist: at each split, each tree may only consider a random subset of the features. This stops every tree from leaning on the same one or two dominant predictors, forcing them to be genuinely different — more decorrelation, better averaging. Random forests are robust, need little tuning, give a free accuracy estimate (the out-of-bag error from data each tree didn't see), and report useful feature importance. They're the reliable, low-drama default.

Boosting: learning from mistakes, in sequence

Boosting takes the opposite approach. Instead of independent parallel trees, it builds them sequentially, each one focused on the mistakes of the ones before. Train a weak tree; see where it errs; train the next tree to fix those errors; repeat. The ensemble grows by relentlessly attacking its own remaining weaknesses.

The two strategies. Bagging trains many trees in parallel on different samples and averages them — cutting variance. Boosting trains trees in sequence, each correcting the last's errors — cutting bias. Parallel independence vs sequential correction.

The original AdaBoost did this by re-weighting: misclassified points get more weight, so the next tree pays them more attention. Where bagging attacks variance, boosting attacks bias — it turns a sequence of weak learners into a single strong one by systematic error-correction.

Gradient boosting & XGBoost

Gradient boosting is the powerful, general form of the idea. Rather than re-weighting points, each new tree is trained to predict the residuals — the errors — of the ensemble so far. Add that tree's correction (shrunk by a learning rate $\eta$ ), and the predictions improve a step:

F_m(x) = F_{m-1}(x) + \eta\, h_m(x)

The name comes from the insight that fitting the residuals is really doing gradient descent — each tree is a step down the gradient of the loss, in function space. It's the optimisation idea from the calculus page, applied to building an ensemble.

XGBoost and LightGBM are the engineered, industrial-strength implementations that made gradient boosting dominate. They add regularisation to curb overfitting, clever handling of missing values, and serious speed optimisations. On structured/tabular data they remain, year after year, the model to beat — often the first thing a practitioner reaches for and frequently the last, because little else outperforms them there.

Bagging vs boosting: which when

The two strategies have complementary characters, and the choice follows from what's wrong:

Bagging / random forests — parallel, reduces variance. Robust, hard to overfit, minimal tuning, parallelisable. The safe, strong baseline.
Boosting / XGBoost — sequential, reduces bias. Usually higher accuracy when tuned well, but more sensitive — it can overfit, needs careful tuning (learning rate, tree depth, early stopping), and can't be parallelised the same way.

The honest costs

Ensembles aren't free wins. The trade-offs you accept:

Interpretability — a single tree is a readable flowchart; a forest of 500 boosted trees is a black box. You buy accuracy with opacity, which matters anywhere a decision must be explained.
Boosting can overfit — its relentless error-chasing will eventually fit noise. Cross-validation and early stopping are not optional.
Cost — training and serving hundreds of trees is heavier than one model.

The partial answer to opacity is explainability tooling — SHAP values and the like — which attribute each prediction back to its features. Useful, but a reconstruction after the fact, not the genuine transparency of a simple model. When the explanation matters as much as the answer, that trade-off has to be weighed honestly.

Where it shows up in my work

The default for structured prediction

For the tabular, structured data that most analytical work runs on, ensembles are simply the best tool — so when a prediction problem lands on my desk, a random forest is the strong baseline and gradient boosting the accuracy ceiling. Knowing why they work (decorrelated errors; variance vs bias) is what lets me pick the right one and tune it sensibly rather than turning knobs at random.

But the interpretability cost is exactly the consideration that matters most in a government setting, where a decision often has to be explained and defended, not just made accurately. That's the live tension — a boosted model might be more accurate while a simpler one is more defensible — and naming it honestly (with SHAP to narrow the gap, and proper validation to trust the accuracy) is the real skill. It ties straight to the "when not to go deep" judgement: pick the model the problem actually needs.

Refresh in 60 seconds

Combine many decorrelated weak models and their errors cancel — wisdom of crowds. Decorrelation is everything.
Decision trees are the base learner — a single one overfits (high variance), which is exactly what makes it a great ensemble ingredient.
Bagging → random forests: parallel trees on bootstrap samples + random feature subsets, averaged. Cuts variance; robust, low-tuning, out-of-bag error + feature importance.
Boosting → XGBoost/LightGBM: sequential trees each fixing the last's errors; gradient boosting fits the residuals ( $F_m = F_{m-1} + \eta h_m$ ). Cuts bias; the tabular champion.
Forest = strong with little fuss; boosting = max accuracy with tuning (and it can overfit — cross-validate, early-stop).
The cost is interpretability (a black box; SHAP helps) and compute — weigh it where a decision must be defended.

The bagging-vs-boosting framing, gradient-boosting-as-residual-fitting, and XGBoost's regularisation/early-stopping practice reflect current ensemble-learning references alongside ML coursework.