Loading graph…

k-NN Regression

Definition

k-nearest-neighbors regression, or k-NN regression, is a nonparametric method that predicts a target value by averaging the target values of the $k$ most similar observations.

The prediction is:

\[\hat m(x) = \frac{1}{k}\sum_{i \in N_k(x)} y_i\]

where:

  • $N_k(x)$ is the set of the $k$ nearest observations to $x$.
  • $y_i$ is the target value of neighbor $i$.

Main Idea

To predict what happens for a new observation, look at the most similar historical observations and average their outcomes.

Distance

The word nearest depends on a distance function.

For one variable:

\[d(x, x_i) = |x - x_i|\]

For multiple variables:

\[d(x, x_i) = \sqrt{\sum_{j=1}^{p}(x_j - x_{ij})^2}\]

This is Euclidean distance.

Basket Size Example

Suppose:

  • $x$ = current state of a basket.
  • $y_i$ = final basket size of historical basket $i$.

A simple one-dimensional version uses current item count:

\[x = k\]

The model finds the $k$ historical baskets with current sizes closest to the current basket size and averages their final basket sizes.

Choosing k

The value of $k$ controls smoothness.

Small $k$:

  • More local.
  • More flexible.
  • Higher variance.

Large $k$:

  • More stable.
  • Smoother.
  • Higher bias.

Weighted k-NN

A weighted version gives closer neighbors more importance:

\[\hat m(x) = \frac{\sum_{i \in N_k(x)} w_i y_i}{\sum_{i \in N_k(x)} w_i}\]

where larger weights are assigned to closer points.

Strengths

  • Simple to understand.
  • No training phase beyond storing the data.
  • Can model nonlinear relationships.
  • Works well when similarity is meaningful.

Weaknesses

  • Slow prediction on large datasets.
  • Sensitive to irrelevant features.
  • Sensitive to feature scaling.
  • Performs poorly in high dimensions.
  • Rare extreme cases may still be poorly predicted.

Feature Scaling

k-NN depends on distance, so features must be on comparable scales.

For example, if one feature is measured in euros and another in item counts, standardization may be needed.

Relation to Conditional Mean Estimation

k-NN regression estimates:

\[E[Y \mid X=x]\]

by averaging the observed $Y$ values near $x$.

Exercises

  1. Explain why k-NN regression is nonparametric.
  2. What happens if $k$ is too small?
  3. What happens if $k$ is too large?
  4. For basket prediction, list three possible features for measuring similarity between baskets.

See

Nonparametric Regression

Conditional Mean Estimation

Kernel Regression

25

25
Ready to start
k-NN Regression
Session: 1 | Break: Short
Today: 0 sessions
Total: 0 sessions