12 Prediction and Goodness of Fit

How Precise Are Our Predictions, and How Well Does the Model Fit?

Prediction

Goodness of Fit

Simple Regression

Author

Jake Anderson

Published

March 21, 2026

Modified

March 26, 2026

Abstract

A point prediction is incomplete without a measure of uncertainty. This chapter develops prediction intervals for individual outcomes, introduces the sum of squares decomposition SST = SSR + SSE, defines the coefficient of determination $R^2$, and explains why statistical conclusions are invariant to data scaling.

12.1 Prediction Intervals vs. Confidence Intervals

In earlier chapters we built confidence intervals for parameters like $\beta_1$ and $\beta_2$. Those intervals describe where the true population parameter sits. Now we ask a different question: if a household earns $2,000/week, how much will it actually spend on food?

The point prediction is straightforward: plug $x_0$ into the fitted line to get $\hat{y}_0 = b_1 + b_2 x_0$. For the food expenditure data, this gives $\hat{y}_0 = 83.42 + 10.21 \times 20 = 287.62$. But two sources of uncertainty surround that number. First, we estimated the regression line from a sample, so the line itself is imprecise. Second, even if we knew the true line perfectly, any single household scatters around it by an individual error $e_0$.

Two sources of uncertainty: (1) estimation error from sampling the regression line, and (2) individual noise from the person-specific error term $e_0$.

The variance of the forecast error captures both components:

Theorem 12.1 (Forecast Error Variance) For a prediction at $x_0$, the variance of the forecast error is:

\[ \text{Var}(f) = \sigma^2 \left[\underbrace{1}_{\text{individual error}} + \underbrace{\frac{1}{N} + \frac{(x_0 - \bar{x})^2}{\sum(x_i - \bar{x})^2}}_{\text{estimation uncertainty}}\right] \tag{12.1}\]

Compare this to the variance used for a confidence interval for the conditional mean $E(y \mid x_0)$, which omits the leading “1.” That extra $\sigma^2$ is the irreducible noise from individual behavior; it does not shrink with more data.

$\implies$ Prediction intervals are always wider than confidence intervals for the mean, and they never collapse to a point. As $N \to \infty$, the prediction interval converges to $\hat{y}_0 \pm t_c \cdot \sigma$.

As the sample grows, estimation uncertainty vanishes, but individual noise remains. The prediction interval approaches $\hat{y}_0 \pm t_c \sigma$; it never collapses to a point.

Predictions are most precise near $\bar{x}$

Both intervals are narrowest at $x_0 = \bar{x}$ and widen as $x_0$ moves away from the center of the data. The term $(x_0 - \bar{x})^2 / \sum(x_i - \bar{x})^2$ drives this: predicting far from the sample mean is less reliable.

Interactive: Prediction Interval vs. Confidence Interval

Use the slider to move the prediction point $x_0$. Observe how both the confidence interval for $E(y \mid x_0)$ and the prediction interval widen as $x_0$ moves away from $\bar{x}$, and notice that the prediction interval is always wider. The $R^2$ gauge shows the fraction of variation explained by the model.

Show code

viewof x0_pred = Inputs.range([5, 35], {value: 20, step: 0.5, label: "Prediction point x₀"})

prediction_data = {
  // Simulated food expenditure data
  const rng = d3.randomLcg(42);
  const rnorm = d3.randomNormal.source(rng)(0, 90);
  const N = 40;
  const b1_true = 83.42, b2_true = 10.21;
  const xs = Array.from({length: N}, () => 5 + rng() * 30);
  const ys = xs.map(x => b1_true + b2_true * x + rnorm());

  const xbar = d3.mean(xs);
  const ybar = d3.mean(ys);
  const Sxx = d3.sum(xs.map(x => (x - xbar) ** 2));
  const Sxy = d3.sum(xs.map((x, i) => (x - xbar) * (ys[i] - ybar)));
  const b2 = Sxy / Sxx;
  const b1 = ybar - b2 * xbar;

  const residuals = xs.map((x, i) => ys[i] - (b1 + b2 * x));
  const sigmaHat2 = d3.sum(residuals.map(r => r * r)) / (N - 2);
  const SST = d3.sum(ys.map(y => (y - ybar) ** 2));
  const SSE = d3.sum(residuals.map(r => r * r));
  const R2 = 1 - SSE / SST;

  const tc = 2.024; // t(0.975, 38)

  // Generate bands across x range
  const xRange = d3.range(3, 37, 0.5);
  const bands = xRange.map(x => {
    const varCI = sigmaHat2 * (1/N + (x - xbar)**2 / Sxx);
    const varPI = sigmaHat2 * (1 + 1/N + (x - xbar)**2 / Sxx);
    const yhat = b1 + b2 * x;
    return {
      x: x,
      yhat: yhat,
      ci_lo: yhat - tc * Math.sqrt(varCI),
      ci_hi: yhat + tc * Math.sqrt(varCI),
      pi_lo: yhat - tc * Math.sqrt(varPI),
      pi_hi: yhat + tc * Math.sqrt(varPI)
    };
  });

  const yhat0 = b1 + b2 * x0_pred;
  const varCI0 = sigmaHat2 * (1/N + (x0_pred - xbar)**2 / Sxx);
  const varPI0 = sigmaHat2 * (1 + 1/N + (x0_pred - xbar)**2 / Sxx);

  return {
    xs, ys, b1, b2, xbar, bands, R2, yhat0,
    ci_width: 2 * tc * Math.sqrt(varCI0),
    pi_width: 2 * tc * Math.sqrt(varPI0),
    se_f: Math.sqrt(varPI0)
  };
}

Plot.plot({
  width: 700,
  height: 420,
  marginLeft: 60,
  x: {label: "x (weekly income, $100s)", domain: [3, 37]},
  y: {label: "y (food expenditure, $)", domain: [-50, 550]},
  marks: [
    Plot.areaY(prediction_data.bands, {x: "x", y1: "pi_lo", y2: "pi_hi", fill: "#1E5A96", fillOpacity: 0.1}),
    Plot.areaY(prediction_data.bands, {x: "x", y1: "ci_lo", y2: "ci_hi", fill: "#1E5A96", fillOpacity: 0.25}),
    Plot.line(prediction_data.bands, {x: "x", y: "yhat", stroke: "#1E5A96", strokeWidth: 2}),
    Plot.dot(prediction_data.xs.map((x, i) => ({x, y: prediction_data.ys[i]})), {x: "x", y: "y", r: 3, fill: "#666"}),
    Plot.ruleX([x0_pred], {stroke: "red", strokeDasharray: "4,4", strokeWidth: 1.5}),
    Plot.dot([{x: x0_pred, y: prediction_data.yhat0}], {x: "x", y: "y", r: 6, fill: "red", symbol: "diamond"}),
    Plot.ruleX([prediction_data.xbar], {stroke: "#2E8B57", strokeDasharray: "2,6", strokeWidth: 1})
  ]
})

Show code

html`<div style="display:flex; gap:2em; flex-wrap:wrap; margin-top:0.5em">
  <div><strong>ŷ₀ = ${prediction_data.yhat0.toFixed(2)}</strong></div>
  <div>CI width: <span style="color:#1E5A96">${prediction_data.ci_width.toFixed(2)}</span></div>
  <div>PI width: <span style="color:#C41E3A">${prediction_data.pi_width.toFixed(2)}</span></div>
  <div>se(f) = ${prediction_data.se_f.toFixed(2)}</div>
  <div>R² = <strong>${(prediction_data.R2 * 100).toFixed(1)}%</strong></div>
  <div style="color:#2E8B57">x̄ = ${prediction_data.xbar.toFixed(1)}</div>
</div>`

12.2 The Sum of Squares Decomposition

How much of the variation in $y$ does the model actually explain? To answer this, decompose each observation’s deviation from the mean:

\[ \underbrace{y_i - \bar{y}}_{\text{total}} = \underbrace{(\hat{y}_i - \bar{y})}_{\text{explained}} + \underbrace{\hat{e}_i}_{\text{unexplained}} \]

Squaring and summing over all observations yields the fundamental identity:

Theorem 12.2 (Sum of Squares Decomposition) \[ \underbrace{\sum(y_i - \bar{y})^2}_{SST} = \underbrace{\sum(\hat{y}_i - \bar{y})^2}_{SSR} + \underbrace{\sum \hat{e}_i^2}_{SSE} \tag{12.2}\]

The Total Sum of Squares (SST) splits into the Regression Sum of Squares (SSR, variation explained by the model) and the Error Sum of Squares (SSE, variation left unexplained). This identity requires an intercept in the model; without one, the cross-product term does not vanish.

SST = total variation in $y$. SSR = variation explained by $x$. SSE = variation left over. The decomposition requires an intercept.

flowchart LR
    SST["SST<br/>Total variation<br/>in y"] --> SSR["SSR<br/>Explained by<br/>the model"]
    SST --> SSE["SSE<br/>Unexplained<br/>(residuals)"]
    SSR --> R2["R² = SSR / SST"]
    SSE --> R2

    style SST fill:#1E5A96,color:#fff
    style SSR fill:#2E8B57,color:#fff
    style SSE fill:#C41E3A,color:#fff
    style R2 fill:#D4A84B,color:#fff

Figure 12.2: The sum of squares decomposition splits total variation into explained and unexplained components.

12.3 $R^2$: The Coefficient of Determination

Definition 12.1 (Coefficient of Determination) The coefficient of determination is the fraction of total variation explained:

\[ R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} \tag{12.3}\]

It is bounded between 0 and 1. An $R^2$ of 1 means every observation falls exactly on the fitted line ($SSE = 0$); an $R^2$ of 0 means $x$ explains none of the variation in $y$. In simple regression, $R^2 = r_{xy}^2$, the squared sample correlation between $x$ and $y$. More generally, $R^2 = \text{Corr}(y_i, \hat{y}_i)^2$, which extends to multiple regression.

Limitations of $R^2$

$R^2$ does not tell you whether the model is correctly specified (a high $R^2$ with the wrong functional form still gives misleading predictions), whether the coefficients are statistically significant (a model can have $R^2 = 0.90$ from a time trend while the regressor of interest is insignificant), or whether the model predicts well out of sample. Evaluate a model by its coefficients, residual plots, and theory, not by $R^2$ alone.

What counts as “good”? Cross-sectional microdata: 0.10 to 0.40. Time series: 0.60 to 0.95. The answer depends entirely on the setting.

What counts as a “good” $R^2$ depends entirely on context. Cross-sectional microdata (wages, spending) typically produce $R^2$ values between 0.10 and 0.40 because individual behavior is noisy. Time series data (GDP, yields) often reach 0.60 to 0.95.

Application	Typical $R^2$	Why
Cross-section wage equations	0.15 to 0.35	Individual behavior is highly variable
Consumption functions (cross-section)	0.05 to 0.30	Household spending is noisy
Time series macro (GDP, inflation)	0.80 to 0.98	Strong trends dominate
Asset pricing (CAPM betas)	0.30 to 0.70	Market explains a lot, but not everything

Claim	Problem
“Higher $R^2$ = better model”	A quadratic in time gives $R^2 \approx 1$ for GDP, but tells us nothing about causes
“$R^2 = 0.15$ means the model is bad”	In micro studies, 0.15 is typical and the coefficients can still be well-estimated
“Adding variables always improves the model”	$R^2$ can only rise, but precision can fall; see adjusted $R^2$

12.4 Effects of Data Scaling

Rescaling $x$ by a constant $c$ (for example, changing income from hundreds of dollars to dollars) divides the slope and its standard error by $c$, but leaves $t$-statistics, $p$-values, and $R^2$ unchanged. Rescaling $y$ multiplies all coefficients and standard errors by $c$, again leaving $t$-statistics and $R^2$ untouched.

Scaling preserves all inferential conclusions

$t$-statistics, $p$-values, confidence interval coverage, $R^2$, and $F$-statistics are all invariant to linear rescaling of the data. Only the units of the coefficients change. If someone asks whether your results change if you measure income in thousands instead of dollars, the answer is no.

Scaling rule: multiply $x$ by $c$ $\implies$ divide $b_2$ and $\text{se}(b_2)$ by $c$. Multiply $y$ by $c$ $\implies$ multiply all coefficients and SEs by $c$. Ratios ($t$, $F$, $R^2$) stay the same.

$\implies$ Scaling is a change in units, not a change in the relationship. Statistical conclusions are invariant to the choice of units.

12.5 Practice

A researcher estimates the food expenditure model ($N = 40$) and reports $\hat{\sigma}^2 = 8013$, $\bar{x} = 19.60$, and $\sum(x_i - \bar{x})^2 = 1462$. Compute the standard error of the forecast at $x_0 = 20$ and construct a 95% prediction interval given $\hat{y}_0 = 287.62$ and $t_{0.975, 38} = 2.024$.

Show Solution

Plug into Equation 12.1:

\[ \widehat{\text{Var}}(f) = 8013 \left[1 + \frac{1}{40} + \frac{(20 - 19.60)^2}{1462}\right] = 8013 \times 1.02511 = 8214.2 \]

\[ \text{se}(f) = \sqrt{8214.2} = 90.63 \]

The 95% prediction interval is $287.62 \pm 2.024 \times 90.63 = [104.15, \; 471.09]$. This range is enormous: we predict somewhere between $104 and $471 for a household earning $2,000/week. The width reflects the large individual-level noise ($\sigma^2 = 8013$), which no amount of data eliminates.

Slides

Download handout slides (PDF)

Download presentation slides with transitions (PDF)