Regression




Definition
Regression is the task of predicting a numerical target variable from one or more input variables.
The target variable is usually written as:
\[Y\]The input variables are usually written as:
\[X\]The goal is to learn a function:
\[\hat y = f(x)\]that predicts $Y$ from $X=x$.
Central Quantity
The central theoretical object in regression is the conditional mean:
\[m(x) = E[Y \mid X=x]\]This is the expected value of $Y$ given that the input is $x$.
A regression model estimates this function:
\[\hat m(x) \approx m(x)\]Parametric Regression
In Parametric Regression, we assume a fixed model form with a finite number of parameters.
Examples:
- Linear Regression
- Polynomial Regression
- Logistic Regression
- Poisson Regression
- Exponential Regression
Nonparametric Regression
In Nonparametric Regression, we do not assume one fixed global shape for the regression function.
Examples:
- Conditional Mean Estimation
- Kernel Regression
- k-NN Regression
- Local Smoothing
- Splines
- Regression Trees
- Random Forests
Tree-Based and Ensemble Regression
Tree-based methods are flexible nonparametric regression tools.
Useful pages:
Prediction Error
A common regression error measure is squared error:
\[(y_i - \hat y_i)^2\]The mean squared error is:
\[MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat y_i)^2\]The conditional mean is the best prediction under squared-error loss.
Retail Examples
Regression can be used to predict:
- Final basket size from partial basket size.
- Customer spending from recency and frequency.
- Future demand from past demand.
- Customer lifetime value from purchasing history.
Classification Connection
Some methods called regression are used for classification.
For example, Logistic Regression predicts:
\[P(Y=1 \mid X=x)\]This works because for binary targets:
\[E[Y \mid X=x] = P(Y=1 \mid X=x)\]Model Evaluation
Important connected pages:
Exercises
- Explain why regression is connected to conditional expectation.
- Give one retail example of a regression problem.
- Explain the difference between parametric and nonparametric regression.