Random Forests
Definition
A random forest is an ensemble of decision trees.
For regression, it averages the predictions of many regression trees:
\[\hat f(x) = \frac{1}{B}\sum_{b=1}^{B}\hat f_b(x)\]where $B$ is the number of trees.
Main Idea
Random forests improve single decision trees by using two kinds of randomness:
- bootstrap samples of the data
- random subsets of features at each split
This creates many different trees.
The final prediction is the average of their predictions.
Why Random Features Matter
If all trees see the same strongest features at every split, they may become too similar.
Randomly limiting the available features makes the trees less correlated.
Less correlated trees produce a better average.
Regression Forest Prediction
Each tree gives a prediction:
\[\hat f_b(x)\]The forest prediction is:
\[\hat f(x) = \frac{1}{B}\sum_{b=1}^{B}\hat f_b(x)\]Retail Example
A random forest could predict final basket size using:
- current basket size
- number of known items
- customer frequency
- customer recency
- country
- month
- previous average spend
This is useful when basket size depends on nonlinear interactions between customer behavior and basket contents.
Strengths
- Usually more accurate than one tree.
- Handles nonlinear relationships.
- Handles interactions automatically.
- Less unstable than a single tree.
- Can estimate feature importance.
Weaknesses
- Less interpretable than a single tree.
- Can still underpredict rare extreme values.
- Requires tuning.
- Predictions are averages, so extreme predictions are often damped.
Relation to Basket-Size Prediction
If very large baskets are rare, a random forest may still underestimate them.
This happens because averaging many trees pulls predictions toward more common outcomes.
This is one reason predicted and actual values may diverge for extreme baskets.
Exercises
- Explain why random forests average many trees.
- Why does using random subsets of features help?
- In retail data, why might a random forest underpredict very large baskets?