Linear Regression
Definition
Linear regression is a parametric regression method that models the target variable as a linear function of one or more input variables.
For one input variable:
\[Y = \beta_0 + \beta_1X + \varepsilon\]For multiple input variables:
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p + \varepsilon\]Prediction Function
The fitted prediction rule is:
\[\hat Y = \hat\beta_0 + \hat\beta_1X_1 + \hat\beta_2X_2 + \cdots + \hat\beta_pX_p\]where $\hat\beta_j$ are estimated from the data.
Core Idea
Linear regression finds the line, plane, or hyperplane that best predicts $Y$ from $X$.
The most common fitting method is ordinary least squares, which minimizes the sum of squared errors:
\[\sum_{i=1}^{n}(y_i - \hat y_i)^2\]Residuals
The residual for observation $i$ is:
\[e_i = y_i - \hat y_i\]It measures how far the prediction is from the observed value.
Interpretation of Coefficients
In simple linear regression:
\[\hat Y = \hat\beta_0 + \hat\beta_1X\]- $\hat\beta_0$ is the predicted value when $X = 0$.
- $\hat\beta_1$ is the expected change in $Y$ for a one-unit increase in $X$.
Relation to Conditional Mean
Linear regression estimates the conditional mean under a linear assumption:
\[E[Y \mid X=x] = \beta_0 + \beta_1x\]So linear regression is a restricted form of Conditional Mean Estimation.
Assumptions
The classical assumptions are:
- The relationship is approximately linear.
- Errors have mean zero.
- Errors have constant variance.
- Observations are independent.
- There is no serious multicollinearity among predictors.
Example: Basket Size Prediction
Let:
- $X$ = number of observed items so far.
- $Y$ = final item count in the basket.
A linear model would be:
\[\hat Y = \hat\beta_0 + \hat\beta_1X\]If $\hat\beta_1 = 1.8$, then each observed item is associated with an increase of about 1.8 final items on average.
Strengths
- Simple.
- Interpretable.
- Fast.
- Good baseline model.
- Useful for explaining relationships.
Weaknesses
- Cannot naturally represent curved relationships.
- Sensitive to outliers.
- Can underpredict extreme values.
- Assumes a constant marginal effect.
Diagnostics
Useful checks include:
- Residual plot.
- Predicted vs actual plot.
- Error distribution.
- $R^2$.
- Mean absolute error.
- Root mean squared error.
Exercises
- Fit a linear regression model where $X$ is current basket size and $Y$ is final basket size.
- Plot residuals against actual final basket size.
- Explain why linear regression may underestimate very large baskets.