Loading graph…

Linear Regression

Definition

Linear regression is a parametric regression method that models the target variable as a linear function of one or more input variables.

For one input variable:

\[Y = \beta_0 + \beta_1X + \varepsilon\]

For multiple input variables:

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p + \varepsilon\]

Prediction Function

The fitted prediction rule is:

\[\hat Y = \hat\beta_0 + \hat\beta_1X_1 + \hat\beta_2X_2 + \cdots + \hat\beta_pX_p\]

where $\hat\beta_j$ are estimated from the data.

Core Idea

Linear regression finds the line, plane, or hyperplane that best predicts $Y$ from $X$.

The most common fitting method is ordinary least squares, which minimizes the sum of squared errors:

\[\sum_{i=1}^{n}(y_i - \hat y_i)^2\]

Residuals

The residual for observation $i$ is:

\[e_i = y_i - \hat y_i\]

It measures how far the prediction is from the observed value.

Interpretation of Coefficients

In simple linear regression:

\[\hat Y = \hat\beta_0 + \hat\beta_1X\]
  • $\hat\beta_0$ is the predicted value when $X = 0$.
  • $\hat\beta_1$ is the expected change in $Y$ for a one-unit increase in $X$.

Relation to Conditional Mean

Linear regression estimates the conditional mean under a linear assumption:

\[E[Y \mid X=x] = \beta_0 + \beta_1x\]

So linear regression is a restricted form of Conditional Mean Estimation.

Assumptions

The classical assumptions are:

  • The relationship is approximately linear.
  • Errors have mean zero.
  • Errors have constant variance.
  • Observations are independent.
  • There is no serious multicollinearity among predictors.

Example: Basket Size Prediction

Let:

  • $X$ = number of observed items so far.
  • $Y$ = final item count in the basket.

A linear model would be:

\[\hat Y = \hat\beta_0 + \hat\beta_1X\]

If $\hat\beta_1 = 1.8$, then each observed item is associated with an increase of about 1.8 final items on average.

Strengths

  • Simple.
  • Interpretable.
  • Fast.
  • Good baseline model.
  • Useful for explaining relationships.

Weaknesses

  • Cannot naturally represent curved relationships.
  • Sensitive to outliers.
  • Can underpredict extreme values.
  • Assumes a constant marginal effect.

Diagnostics

Useful checks include:

  • Residual plot.
  • Predicted vs actual plot.
  • Error distribution.
  • $R^2$.
  • Mean absolute error.
  • Root mean squared error.

Exercises

  1. Fit a linear regression model where $X$ is current basket size and $Y$ is final basket size.
  2. Plot residuals against actual final basket size.
  3. Explain why linear regression may underestimate very large baskets.

See

Parametric Regression

Polynomial Regression

Conditional Mean Estimation

25

25
Ready to start
Linear Regression
Session: 1 | Break: Short
Today: 0 sessions
Total: 0 sessions