Cross-Validation

Definition

Cross-validation is a method for estimating how well a model will perform on unseen data.

It repeatedly splits the data into training and validation parts.

Main Idea

Instead of judging a model only on the data used to train it, cross-validation tests the model on data that was held out.

This helps estimate generalization error.

K-Fold Cross-Validation

In k-fold cross-validation, the data is split into $K$ folds.

The model is trained $K$ times.

Each time, one fold is used for validation and the others are used for training.

The final score is the average validation error.

Regression Error

For regression, a common validation error is mean squared error:

\[MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat y_i)^2\]

Time Series Warning

For time-ordered data, ordinary random cross-validation can leak future information into the past.

For retail transactions, a time-based split is often more realistic:

\[\text{train on past} \rightarrow \text{test on future}\]

Retail Example

For customer return prediction, train on earlier months and test on later months.

For basket completion, split by invoice so rows from the same invoice do not appear in both training and test sets.

Strengths

Gives a better estimate of test performance.
Useful for model selection.
Helps tune hyperparameters.

Weaknesses

Can be computationally expensive.
Must respect data structure.
Random splits can be invalid for temporal or grouped data.

Exercises

Why is testing on training data misleading?
Why is random row splitting dangerous for invoice data?
When should a time-based split be preferred?

Cross-Validation

Definition

Main Idea

K-Fold Cross-Validation

Regression Error

Time Series Warning

Retail Example

Strengths

Weaknesses

Exercises

See

Regression

Bias-Variance Tradeoff

Random Forests

Splines

Cross-Validation

Definition

Main Idea

K-Fold Cross-Validation

Regression Error

Time Series Warning

Retail Example

Strengths

Weaknesses

Exercises

See

Regression

Bias-Variance Tradeoff

Random Forests

Splines

Sessions by Day

Productivity by Hour

Session Completion Rate

Time Spent by Task

Sessions by Day of Week

Session Duration Distribution