Cross-Validation
Definition
Cross-validation is a method for estimating how well a model will perform on unseen data.
It repeatedly splits the data into training and validation parts.
Main Idea
Instead of judging a model only on the data used to train it, cross-validation tests the model on data that was held out.
This helps estimate generalization error.
K-Fold Cross-Validation
In k-fold cross-validation, the data is split into $K$ folds.
The model is trained $K$ times.
Each time, one fold is used for validation and the others are used for training.
The final score is the average validation error.
Regression Error
For regression, a common validation error is mean squared error:
\[MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat y_i)^2\]Time Series Warning
For time-ordered data, ordinary random cross-validation can leak future information into the past.
For retail transactions, a time-based split is often more realistic:
\[\text{train on past} \rightarrow \text{test on future}\]Retail Example
For customer return prediction, train on earlier months and test on later months.
For basket completion, split by invoice so rows from the same invoice do not appear in both training and test sets.
Strengths
- Gives a better estimate of test performance.
- Useful for model selection.
- Helps tune hyperparameters.
Weaknesses
- Can be computationally expensive.
- Must respect data structure.
- Random splits can be invalid for temporal or grouped data.
Exercises
- Why is testing on training data misleading?
- Why is random row splitting dangerous for invoice data?
- When should a time-based split be preferred?