Bagging
Definition
Bagging means bootstrap aggregating.
It is an ensemble method where many models are trained on different bootstrap samples of the training data and then averaged.
Bootstrap Samples
A bootstrap sample is created by sampling from the training data with replacement.
If the original dataset has $n$ observations, each bootstrap sample also usually has $n$ observations, but some rows appear multiple times and some are left out.
Regression Prediction
For regression, if we train $B$ models, bagging predicts:
\[\hat f_{bag}(x) = \frac{1}{B}\sum_{b=1}^{B}\hat f_b(x)\]where $\hat f_b(x)$ is the prediction from model $b$.
Main Idea
Bagging is especially useful for high-variance models.
Decision trees have high variance, so bagging is a natural way to improve them.
Relation to Random Forests
Random forests are based on bagging, but add one more idea.
At each tree split, a random forest considers only a random subset of features.
This makes the trees less correlated and improves the ensemble.
Retail Example
For basket-size prediction, bagging could train many regression trees on different bootstrap samples of invoices.
The final prediction is the average of the tree predictions.
Strengths
- Reduces variance.
- Improves unstable models.
- Simple idea.
- Works especially well with trees.
Weaknesses
- Less interpretable than a single tree.
- Does not strongly reduce bias.
- Requires training many models.
Exercises
- What does sampling with replacement mean?
- Why is bagging useful for decision trees?
- How is random forest different from ordinary bagging?