8  Properties of OLS and Gauss-Markov

Why Should We Trust These Estimates?

Regression
OLS
Gauss-Markov
Author

Jake Anderson

Published

March 21, 2026

Modified

March 26, 2026

Abstract

Different samples give different OLS estimates. Is OLS systematically correct on average? How much does the slope bounce around? And is there a better linear estimator? This chapter answers all three questions: OLS is unbiased under SR1–SR2, its variance depends on noise, x-spread, and sample size, and the Gauss-Markov theorem proves OLS has the smallest variance among all linear unbiased estimators.

NotePrerequisites

You should be familiar with OLS estimation from Chapter 6 and the regression assumptions from Chapter 5.

8.1 Sampling Variation

We estimated \(b_2 = 10.21\) from 40 Australian households. If we surveyed a different group of 40 households, we would almost certainly get a different number. The estimate depends on the specific households in the sample. \(\implies\) \(b_2\) is a random variable with a distribution, a mean, and a variance.

8.2 \(b_2\) as a Weighted Sum

To study the properties of \(b_2\), rewrite it in a form where we can take expectations. Define weights \(w_i = (x_i - \bar{x}) / \sum(x_j - \bar{x})^2\), which depend only on the \(x\)-values (treated as fixed). Then \(b_2 = \sum w_i y_i\): a weighted sum of the \(y_i\) values. This makes \(b_2\) a linear estimator. The weights satisfy two properties: \(\sum w_i = 0\) and \(\sum w_i x_i = 1\).

Linear estimator: \(b_2 = \sum w_i y_i\) is a linear function of the data. The weights \(w_i\) depend on the \(x\)-values, which are treated as fixed. All the randomness in \(b_2\) comes from the \(y_i\)’s (and therefore the \(e_i\)’s).

Substituting the model \(y_i = \beta_1 + \beta_2 x_i + e_i\) into \(b_2 = \sum w_i y_i\) and using the weight properties:

\[ b_2 = \beta_2 + \sum_{i=1}^{N} w_i e_i \tag{8.1}\]

The estimator equals the true parameter plus a weighted sum of random errors. All the randomness in \(b_2\) comes from the \(e_i\)’s.

8.3 Unbiasedness: \(E(b_2) = \beta_2\)

Take the expected value of Equation 8.1:

Theorem 8.1 (Unbiasedness of OLS) \[ E(b_2) = \beta_2 + \sum_{i=1}^{N} w_i \, E(e_i) = \beta_2 + 0 = \beta_2 \tag{8.2}\]

Under SR1 and SR2, the OLS slope estimator is unbiased: on average, across all possible samples, \(b_2\) hits the true slope \(\beta_2\).

The last step uses SR2: \(E(e_i) = 0\) for every \(i\). The same argument shows \(E(b_1) = \beta_1\).

Unbiasedness means the procedure is correct on average; it does not mean any single estimate is close to \(\beta_2\). It is a property of the procedure, not of any one number.

WarningWhen does unbiasedness fail?

If SR2 fails (for example, because an important variable correlated with \(x\) is omitted), then \(E(e_i \mid x_i) \neq 0\) and \(E(b_2) \neq \beta_2\). This is omitted variable bias. The OLS estimator systematically overshoots or undershoots the true slope.

Start from \(b_2 = \sum w_i y_i\) where \(w_i = (x_i - \bar{x}) / \sum(x_j - \bar{x})^2\).

Substitute \(y_i = \beta_1 + \beta_2 x_i + e_i\): \[b_2 = \sum w_i (\beta_1 + \beta_2 x_i + e_i) = \beta_1 \sum w_i + \beta_2 \sum w_i x_i + \sum w_i e_i\]

Using \(\sum w_i = 0\) and \(\sum w_i x_i = 1\): \[b_2 = \beta_2 + \sum w_i e_i\]

Take expectations (treating \(x_i\) and therefore \(w_i\) as fixed): \[E(b_2) = \beta_2 + \sum w_i E(e_i) = \beta_2 + 0 = \beta_2\]

where the last equality uses SR2: \(E(e_i) = 0\). \(\square\)

8.4 Variance of \(b_2\)

The variance of \(b_2\) tells us how much it bounces around across samples. From \(b_2 = \beta_2 + \sum w_i e_i\), using SR3 (homoskedasticity: \(\operatorname{Var}(e_i) = \sigma^2\)) and SR4 (uncorrelated errors: \(\operatorname{Cov}(e_i, e_j) = 0\)):

Theorem 8.2 (Variance of the OLS Slope) \[ \operatorname{Var}(b_2) = \frac{\sigma^2}{\sum_{i=1}^{N}(x_i - \bar{x})^2} \tag{8.3}\]

Three factors control precision:

  1. Error variance \(\sigma^2\) (numerator). More noise in the data generating process \(\implies\) harder to detect the signal \(\implies\) larger \(\operatorname{Var}(b_2)\). You cannot control \(\sigma^2\); it is a feature of the population.

  2. Spread of \(x\)-values (denominator). \(\sum(x_i - \bar{x})^2\) measures total variation in the explanatory variable. More spread in \(x\) \(\implies\) larger denominator \(\implies\) smaller \(\operatorname{Var}(b_2)\). A line is easier to estimate when the \(x\)-values span a wide range.

  3. Sample size \(N\). More observations \(\implies\) more terms in \(\sum(x_i - \bar{x})^2\) \(\implies\) the denominator grows \(\implies\) more data reduces \(\operatorname{Var}(b_2)\).

In practice: The most reliable way to improve precision is to collect more data. If you can design the study, spreading the \(x\)-values over a wide range also helps.

Interactive: repeated sampling simulator

This widget simulates the repeated sampling process. Each “draw” generates a new random sample of size \(N\), estimates the slope \(b_2\), and adds it to the histogram. The histogram converges to the sampling distribution of \(b_2\).

Show code
viewof rs_n = Inputs.range([20, 500], {value: 40, step: 10, label: "Sample size N"})
viewof rs_sigma = Inputs.range([10, 200], {value: 80, step: 5, label: "Error std dev σ"})
viewof rs_draw = Inputs.button("Draw a new sample", {label: ""})

rs_true_beta = 10

rs_estimates = {
  rs_draw;  // react to button
  const rng = d3.randomLcg(Date.now());
  const normal = d3.randomNormal.source(rng)(0, rs_sigma);
  const ndraws = 500;
  const estimates = [];

  for (let r = 0; r < ndraws; r++) {
    const data = [];
    for (let i = 0; i < rs_n; i++) {
      const x = 5 + 30 * rng();
      const y = 83 + rs_true_beta * x + normal();
      data.push({x, y});
    }
    const xbar = d3.mean(data, d => d.x);
    const ybar = d3.mean(data, d => d.y);
    const num = d3.sum(data, d => (d.x - xbar) * (d.y - ybar));
    const den = d3.sum(data, d => (d.x - xbar) ** 2);
    estimates.push({b2: num / den});
  }
  return estimates;
}

rs_stats = {
  const vals = rs_estimates.map(d => d.b2);
  return {
    mean: d3.mean(vals).toFixed(2),
    sd: d3.deviation(vals).toFixed(2)
  };
}

rs_histogram = {
  const vals = rs_estimates.map(d => d.b2);
  const lo = d3.min(vals);
  const hi = d3.max(vals);
  const nBins = 40;
  const binWidth = (hi - lo) / nBins;
  const bins = d3.bin().domain([lo, hi]).thresholds(nBins)(vals);
  return bins.map(b => ({
    x0: b.x0,
    x1: b.x1,
    density: b.length / (vals.length * binWidth)
  }));
}

Plot.plot({
  width: 640,
  height: 380,
  x: {label: "Estimated slope b₂"},
  y: {label: "Density"},
  marks: [
    Plot.rectY(rs_histogram, {x1: "x0", x2: "x1", y: "density", fill: "#1E5A96", fillOpacity: 0.5}),
    Plot.ruleX([rs_true_beta], {stroke: "#C41E3A", strokeWidth: 2.5, strokeDasharray: "6 3"}),
    Plot.ruleY([0]),
    Plot.text([`Mean = ${rs_stats.mean}, SD = ${rs_stats.sd}`], {x: rs_true_beta, y: 0, dy: -10, fill: "#1E5A96", fontWeight: "bold", fontSize: 12, textAnchor: "middle"})
  ],
  caption: `500 samples of N = ${rs_n} with σ = ${rs_sigma}. Red dashed line = true β₂ = ${rs_true_beta}.`
})
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 8.1: Repeated sampling simulator. Each draw generates a new sample and estimates b₂. The histogram builds up the sampling distribution, centered at the true β₂ = 10.

Try it: Increase \(N\) from 40 to 200 and watch the histogram narrow. Then increase \(\sigma\) from 80 to 200 and watch it widen. These are the two forces in Equation 8.3.

8.5 The Gauss-Markov Theorem

OLS is not the only linear unbiased estimator. For example, you could use just the observations with the smallest and largest \(x\)-values to draw a line through the data. That estimator is linear and unbiased, but far less precise. Can we prove OLS is always tighter?

Theorem 8.3 (Gauss-Markov Theorem) Under assumptions SR1 through SR4, the OLS estimators \(b_1\) and \(b_2\) have the smallest variance of all linear and unbiased estimators of \(\beta_1\) and \(\beta_2\).

OLS is the Best Linear Unbiased Estimator: BLUE. “Best” means smallest variance (most precise). “Linear” means a weighted sum of the \(y_i\)’s. “Unbiased” means \(E(b_2) = \beta_2\).

Write any alternative linear unbiased estimator as \(\tilde{b}_2 = \sum (w_i + d_i) y_i\) where \(d_i\) is the departure from OLS weights. For unbiasedness, \(\sum d_i = 0\) and \(\sum d_i x_i = 0\). The variance of the alternative is:

\[\operatorname{Var}(\tilde{b}_2) = \sigma^2 \sum (w_i + d_i)^2 = \sigma^2 \sum w_i^2 + \sigma^2 \sum d_i^2 + 2\sigma^2 \sum w_i d_i\]

The cross-term \(\sum w_i d_i = 0\) (using the unbiasedness conditions on \(d_i\)). So:

\[\operatorname{Var}(\tilde{b}_2) = \operatorname{Var}(b_2) + \sigma^2 \sum d_i^2 \ge \operatorname{Var}(b_2)\]

Any departure from OLS weights adds variance. \(\square\)

\(\implies\) You cannot do better than OLS without either (a) giving up linearity, (b) accepting bias, or (c) violating one of SR1 through SR4. If SR3 fails (heteroskedasticity), OLS is no longer BLUE; Generalized Least Squares (GLS) is better.

flowchart TD
    A["SR1: Linear model"] --> B["b₂ = β₂ + Σwᵢeᵢ<br/>(decomposition)"]
    A --> C
    C["SR1 + SR2"] --> D["E(b₂) = β₂<br/>(unbiased)"]
    C --> E
    E["SR1−SR4"] --> F["Var(b₂) = σ²/Σ(xᵢ−x̄)²"]
    E --> G["OLS is BLUE<br/>(Gauss-Markov)"]

    style A fill:#1E5A96,color:#fff
    style B fill:#1E5A96,color:#fff
    style C fill:#D4A84B,color:#fff
    style D fill:#D4A84B,color:#fff
    style E fill:#2E8B57,color:#fff
    style F fill:#2E8B57,color:#fff
    style G fill:#2E8B57,color:#fff
Figure 8.2: Building up OLS properties: each result requires additional assumptions.
Summary of OLS properties.
Property Result Requires
Linearity \(b_2 = \sum w_i y_i\) Definition of OLS
Decomposition \(b_2 = \beta_2 + \sum w_i e_i\) SR1
Unbiasedness \(E(b_2) = \beta_2\) SR1, SR2
Variance formula \(\operatorname{Var}(b_2) = \sigma^2 / \sum(x_i - \bar{x})^2\) SR1–SR4
BLUE Smallest variance among linear unbiased estimators SR1–SR4

8.6 Practice

Suppose \(\sigma^2 = 8{,}000\) and the \(x\)-values are uniformly spaced so that \(\sum(x_i - \bar{x})^2 \approx 50 \cdot N\) for a sample of size \(N\). Compute \(\operatorname{Var}(b_2)\) for \(N = 40\) and \(N = 200\). By what factor does precision improve?

For \(N = 40\): \(\operatorname{Var}(b_2) = 8{,}000 / (50 \times 40) = 8{,}000 / 2{,}000 = 4.0\). For \(N = 200\): \(\operatorname{Var}(b_2) = 8{,}000 / (50 \times 200) = 8{,}000 / 10{,}000 = 0.8\). The variance drops by a factor of 5 (the ratio of sample sizes), so the standard deviation drops by \(\sqrt{5} \approx 2.24\). Quintupling the sample size roughly halves the standard error.

Slides