Loading graph…

Cross-Validation

Definition

Cross-validation is a method for estimating how well a model will perform on unseen data.

It repeatedly splits the data into training and validation parts.

Main Idea

Instead of judging a model only on the data used to train it, cross-validation tests the model on data that was held out.

This helps estimate generalization error.

K-Fold Cross-Validation

In k-fold cross-validation, the data is split into $K$ folds.

The model is trained $K$ times.

Each time, one fold is used for validation and the others are used for training.

The final score is the average validation error.

Regression Error

For regression, a common validation error is mean squared error:

\[MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat y_i)^2\]

Time Series Warning

For time-ordered data, ordinary random cross-validation can leak future information into the past.

For retail transactions, a time-based split is often more realistic:

\[\text{train on past} \rightarrow \text{test on future}\]

Retail Example

For customer return prediction, train on earlier months and test on later months.

For basket completion, split by invoice so rows from the same invoice do not appear in both training and test sets.

Strengths

  • Gives a better estimate of test performance.
  • Useful for model selection.
  • Helps tune hyperparameters.

Weaknesses

  • Can be computationally expensive.
  • Must respect data structure.
  • Random splits can be invalid for temporal or grouped data.

Exercises

  1. Why is testing on training data misleading?
  2. Why is random row splitting dangerous for invoice data?
  3. When should a time-based split be preferred?

See

Regression

Bias-Variance Tradeoff

Random Forests

Splines

25

25
Ready to start
Cross-Validation
Session: 1 | Break: Short
Today: 0 sessions
Total: 0 sessions