Kernel Regression
Definition
Kernel regression is a nonparametric regression method that predicts a target value by taking a weighted average of nearby observations.
The most common form is the Nadaraya-Watson estimator:
\[\hat m(x) = \frac{\sum_{i=1}^{n} K\left(\frac{x-x_i}{h}\right)y_i}{\sum_{i=1}^{n} K\left(\frac{x-x_i}{h}\right)}\]Components
- $x$ is the input value where we want a prediction.
- $x_i$ is a historical input value.
- $y_i$ is the observed output for case $i$.
- $K$ is the kernel function.
- $h$ is the bandwidth.
Main Idea
Observations closer to $x$ receive more weight.
Observations far from $x$ receive less weight.
The prediction is a local weighted average.
Kernel Function
A kernel function controls how similarity decreases with distance.
A common example is the Gaussian kernel:
\[K(u) = e^{-\frac{u^2}{2}}\]where:
\[u = \frac{x-x_i}{h}\]Bandwidth
The bandwidth $h$ controls smoothness.
Small $h$:
- Uses only very close observations.
- More flexible.
- More noisy.
Large $h$:
- Uses many observations.
- Smoother.
- More biased.
Basket Size Example
Let:
- $x$ = current basket size.
- $y_i$ = final basket size of historical basket $i$.
Kernel regression predicts final basket size by giving high weight to historical baskets with similar current size.
For example, if $x=10$, baskets with current size 9, 10, or 11 may receive high weight, while baskets with current size 100 receive very low weight.
Difference From Simple Conditional Mean Estimation
Simple conditional mean estimation may use only exact matches:
\[X_i = x\]Kernel regression uses approximate matches with weights.
This is useful when exact matches are rare.
Difference From k-NN Regression
Both methods use nearby observations.
| Method | Neighborhood size | Weights |
|---|---|---|
| Kernel regression | Controlled by bandwidth $h$ | Smooth distance weights |
| k-NN regression | Controlled by number of neighbors $k$ | Often equal weights |
Strengths
- Flexible.
- Smooth predictions.
- Uses nearby data instead of exact matches only.
- Good for one-dimensional or low-dimensional problems.
Weaknesses
- Bandwidth choice is important.
- Performs poorly in high dimensions.
- Can be biased near boundaries.
- Sparse regions produce unstable estimates.
Exercises
- Explain the role of the kernel function.
- Explain the role of the bandwidth $h$.
- In basket-size prediction, why might kernel regression be better than exact matching on basket size?