6 The Simple Linear Regression Model

From Scatter Plot to Population Model

Regression

SLR

Assumptions

Author

Jake Anderson

Published

March 21, 2026

Modified

March 26, 2026

Abstract

This chapter builds the simple linear regression model from scratch. Starting with a scatter plot of food expenditure vs income, we motivate the need for an error term, derive the population regression function from the zero conditional mean assumption, state the six assumptions SR1 through SR6, and distinguish between population parameters and sample estimates.

Prerequisites

You should be comfortable with expected value, variance, and the normal distribution from Chapters 2–4.

6.1 The Food Expenditure Data

In Chapter 1, we saw a scatter plot of weekly food expenditure versus weekly household income for 40 Australian households. A positive relationship is visible, but at any given income level, different households spend very different amounts. A deterministic model like $y = 83 + 10x$ fails immediately: it predicts the same spending for all households at the same income, which is clearly wrong.

6.2 Adding the Error Term

What explains the variation in food expenditure beyond income? Household composition, dietary preferences, location, whether the household eats out or cooks at home, impulse shopping, and seasonal effects. Collectively, these factors are represented by a single random variable $e$:

Definition 6.1 (Simple Linear Regression Model) \[ y_i = \beta_1 + \beta_2 x_i + e_i, \quad i = 1, \ldots, N \tag{6.1}\]

where $y_i$ is the dependent variable, $x_i$ is the explanatory variable, $\beta_1$ (intercept) and $\beta_2$ (slope) are unknown population parameters, and $e_i$ is the random error term.

The model splits each observation into a systematic component ($\beta_1 + \beta_2 x_i$, the part of $y_i$ that depends on $x_i$ through the model) and a random component ($e_i$, everything else affecting $y_i$). The parameters $\beta_1$ (intercept) and $\beta_2$ (slope) are unknown population constants that we never observe directly.

The error term $e_i$ is a catch-all for everything the model leaves out. It is not a measurement error; it reflects genuine heterogeneity among households at the same income level.

6.3 The Conditional Mean and the Regression Function

At any income level $x$, there is a distribution of food expenditure values across households. Some spend more, some less. The center of this distribution is $E(y \mid x)$: the conditional mean of food expenditure given income $x$.

Suppose we assume the error term has zero conditional mean: $E(e_i \mid x_i) = 0$. Then take the conditional expectation of both sides of Equation 6.1. Since $\beta_1$ and $\beta_2$ are constants and $x_i$ is fixed when we condition on it:

Theorem 6.1 (Population Regression Function) \[ E(y_i \mid x_i) = \beta_1 + \beta_2 x_i \tag{6.2}\]

The average value of $y$ given $x$ is a linear function of $x$.

The slope $\beta_2 = \Delta E(y \mid x) / \Delta x$ is the change in expected food expenditure per unit change in $x$ (here, per $100 increase in weekly income). The intercept $\beta_1$ is the expected value of $y$ when $x = 0$; it is often just a mathematical anchor for the line, not an economically meaningful quantity.

Conditional vs unconditional: $E(y)$ averages over all households. $E(y \mid x = 20)$ averages only over households with income $2,000/week. The regression line traces out $E(y \mid x)$ as $x$ varies.

6.4 Assumptions SR1 Through SR6

The model $y_i = \beta_1 + \beta_2 x_i + e_i$ is just notation. Without assumptions about $e_i$ and $x_i$, we cannot guarantee that our estimates are correct on average, calculate how precise they are, or construct confidence intervals. We state six assumptions:

Table 6.1: Assumptions of the simple linear regression model.

Label	Name	Statement
SR1	Linear model	$y_i = \beta_1 + \beta_2 x_i + e_i$
SR2	Zero conditional mean	$E(e_i \mid x_i) = 0$
SR3	Homoskedasticity	$\operatorname{Var}(e_i \mid x_i) = \sigma^2$ (constant)
SR4	Uncorrelated errors	$\operatorname{Cov}(e_i, e_j \mid x_i, x_j) = 0$ for $i \neq j$
SR5	$x$ varies	$x_i$ takes at least two distinct values
SR6	Normality (optional)	$e_i \mid x_i \sim N(0, \sigma^2)$

SR2 is the most consequential assumption

It says that knowing the value of income tells you nothing about the average error. If SR2 fails (for example, because an omitted variable like “ability” is correlated with education in a wage regression), then OLS is biased: on average, the estimated slope does not equal the true slope. This bias does not disappear as the sample grows; OLS is also inconsistent.

SR3 (homoskedasticity) says the spread of the error is the same at every income level. If it fails (higher-income households have more variable food spending), OLS is still unbiased, but standard errors are wrong, making confidence intervals and hypothesis tests unreliable.

SR4 (uncorrelated errors) says knowing that household $i$ spent unusually much on food tells you nothing about household $j$. This fails in time-series data (serial correlation) and spatial/clustered data.

SR5 (variation in $x$) is a technical requirement: to estimate a slope, you need at least two distinct $x$ values. More variation in $x$ is better, because it pins down the slope more precisely.

SR6 (normality) is optional. The error $e_i$ is the sum of many small, unrelated factors, so the CLT (Theorem 5.2) suggests such sums tend toward a normal distribution. When $N$ is large enough, the CLT also makes the estimators approximately normal without SR6.

Interactive: assumption violation visualizer

Select an assumption to see what a scatter plot looks like when that assumption is violated.

Show code

viewof assumption = Inputs.radio(
  ["SR1: Linearity", "SR2: Zero conditional mean", "SR3: Homoskedasticity", "SR4: Uncorrelated errors", "SR5: No x variation", "SR6: Normality"],
  {label: "Violated assumption", value: "SR3: Homoskedasticity"}
)

assumption_data = {
  const rng = d3.randomLcg(77);
  const normal = d3.randomNormal.source(rng)(0, 1);
  const n = 120;
  const data = [];

  for (let i = 0; i < n; i++) {
    let x, y;
    if (assumption === "SR1: Linearity") {
      x = 1 + 9 * rng();
      y = 20 + 3 * x - 0.4 * x * x + 15 * normal();
    } else if (assumption === "SR2: Zero conditional mean") {
      x = 1 + 9 * rng();
      const omitted = 0.5 * x + 3 * normal();
      y = 20 + 2 * x + 3 * omitted + 10 * normal();
    } else if (assumption === "SR3: Homoskedasticity") {
      x = 1 + 9 * rng();
      y = 20 + 5 * x + (2 * x) * normal();  // fan shape
    } else if (assumption === "SR4: Uncorrelated errors") {
      x = 1 + 9 * (i / n);
      const e_prev = i > 0 ? data[i - 1].y - (20 + 5 * data[i - 1].x) : 0;
      y = 20 + 5 * x + 0.8 * e_prev + 8 * normal();  // serial correlation
    } else if (assumption === "SR5: No x variation") {
      x = 5 + 0.01 * normal();
      y = 20 + 5 * x + 15 * normal();
    } else {
      x = 1 + 9 * rng();
      // heavy-tailed errors (t-distribution approximation)
      const u = normal();
      const heavyTail = u * (1 + 0.5 * Math.abs(u));
      y = 20 + 5 * x + 15 * heavyTail;
    }
    data.push({x, y});
  }
  return data;
}

Plot.plot({
  width: 640,
  height: 400,
  x: {label: "x", domain: [0, 11]},
  y: {label: "y"},
  marks: [
    Plot.dot(assumption_data, {x: "x", y: "y", fill: "#1E5A96", opacity: 0.5, r: 3.5}),
    Plot.linearRegressionY(assumption_data, {x: "x", y: "y", stroke: "#C41E3A", strokeWidth: 2}),
    Plot.ruleY([0])
  ],
  caption: `Showing: ${assumption} violated`
})

(a)

(b)

(c)

Figure 6.1: What regression data look like when different assumptions fail. Select an assumption to see the violation.

SR3 (heteroskedasticity): Look for the “fan shape” where the spread of points widens as $x$ increases. This is the most common violation in cross-section data.

flowchart TD
    A["SR2 fails:<br/>E(e|x) ≠ 0"] --> B["OLS is BIASED<br/>and INCONSISTENT"]
    C["SR3 fails:<br/>Heteroskedasticity"] --> D["OLS is unbiased, but<br/>standard errors are WRONG"]
    E["SR4 fails:<br/>Correlated errors"] --> D
    F["SR6 fails:<br/>Non-normal errors"] --> G["Small-sample inference<br/>is unreliable<br/>(CLT rescues large samples)"]

    style A fill:#C41E3A,color:#fff
    style B fill:#C41E3A,color:#fff
    style C fill:#D4A84B,color:#fff
    style E fill:#D4A84B,color:#fff
    style D fill:#D4A84B,color:#fff
    style F fill:#888,color:#fff
    style G fill:#888,color:#fff

Figure 6.2: What goes wrong when each assumption fails.

6.5 Parameters vs Estimates

Population parameters ($\beta_1, \beta_2, \sigma^2$) are fixed but unknown constants. Estimates ($b_1, b_2, \hat{\sigma}^2$) are numbers we compute from a particular sample; different samples give different estimates. Estimators ($b_1, b_2$ as formulas, not numbers) are random variables whose properties (bias, variance) we can study. Each sample produces a different fitted line $\hat{y} = b_1 + b_2 x$, and no single estimate equals the true parameter exactly.

Parameters vs estimates vs estimators: $\beta_2$ = population slope (fixed, unknown). $b_2 = 10.21$ = one sample’s estimate (fixed, known). $b_2$ as a formula = estimator (random variable, has a distribution).

$\implies$ The central question of the next several chapters: how close are our estimates to the truth, and how can we quantify that uncertainty?

6.6 Practice

A researcher estimates $\text{WAGE}_i = \beta_1 + \beta_2 \, \text{EDUC}_i + e_i$. The error $e_i$ contains ability, motivation, and family connections. Is SR2 ($E(e_i \mid x_i) = 0$) plausible? Why or why not?

Show Solution

SR2 requires that knowing a person’s education level tells you nothing about their average unobserved ability. But people with higher ability tend to get more education, so the error (which contains ability) is likely positive for high-education individuals: $E(e_i \mid \text{EDUC}_i) > 0$ when EDUC is high. $\implies$ SR2 fails. The OLS slope $b_2$ absorbs the effect of ability and overestimates the true return to education. This is omitted variable bias.

Slides

Download handout slides (PDF)

Download presentation slides with transitions (PDF)