Bias-Variance Tradeoff
Definition
The bias-variance tradeoff describes two different sources of prediction error.
Bias is error from a model being too simple.
Variance is error from a model being too sensitive to the training data.
Bias
A high-bias model underfits.
It misses real structure in the data.
Example:
\[\hat y = \beta_0 + \beta_1x\]may be too simple if the true relationship is curved.
Variance
A high-variance model overfits.
It follows noise in the training data too closely and performs badly on new data.
Very deep decision trees often have high variance.
The Tradeoff
More flexible models usually reduce bias but increase variance.
Less flexible models usually reduce variance but increase bias.
The goal is not maximum flexibility.
The goal is good prediction on unseen data.
Regression Example
For basket-size prediction:
- a straight line may underfit large baskets
- a very flexible tree may overfit rare strange baskets
- a random forest or spline may give a better compromise
Relation to Model Choice
Different models sit at different points in the tradeoff:
| Model | Bias | Variance |
|---|---|---|
| Linear regression | higher | lower |
| Polynomial regression | medium | medium |
| Deep regression tree | lower | higher |
| Random forest | lower | lower than one tree |
| Spline | adjustable | adjustable |
Exercises
- What does it mean for a model to underfit?
- What does it mean for a model to overfit?
- Why can a random forest reduce variance compared with one tree?