How Precise Are Our Predictions, and How Well Does the Model Fit?
Prediction
Goodness of Fit
Simple Regression
Author
Jake Anderson
Published
March 21, 2026
Modified
March 26, 2026
Abstract
A point prediction is incomplete without a measure of uncertainty. This chapter develops prediction intervals for individual outcomes, introduces the sum of squares decomposition SST = SSR + SSE, defines the coefficient of determination \(R^2\), and explains why statistical conclusions are invariant to data scaling.
12.1 Prediction Intervals vs. Confidence Intervals
In earlier chapters we built confidence intervals for parameters like \(\beta_1\) and \(\beta_2\). Those intervals describe where the true population parameter sits. Now we ask a different question: if a household earns $2,000/week, how much will it actually spend on food?
The point prediction is straightforward: plug \(x_0\) into the fitted line to get \(\hat{y}_0 = b_1 + b_2 x_0\). For the food expenditure data, this gives \(\hat{y}_0 = 83.42 + 10.21 \times 20 = 287.62\). But two sources of uncertainty surround that number. First, we estimated the regression line from a sample, so the line itself is imprecise. Second, even if we knew the true line perfectly, any single household scatters around it by an individual error \(e_0\).
Two sources of uncertainty: (1) estimation error from sampling the regression line, and (2) individual noise from the person-specific error term \(e_0\).
The variance of the forecast error captures both components:
Theorem 12.1 (Forecast Error Variance) For a prediction at \(x_0\), the variance of the forecast error is:
Compare this to the variance used for a confidence interval for the conditional mean \(E(y \mid x_0)\), which omits the leading “1.” That extra \(\sigma^2\) is the irreducible noise from individual behavior; it does not shrink with more data.
\(\implies\) Prediction intervals are always wider than confidence intervals for the mean, and they never collapse to a point. As \(N \to \infty\), the prediction interval converges to \(\hat{y}_0 \pm t_c \cdot \sigma\).
As the sample grows, estimation uncertainty vanishes, but individual noise remains. The prediction interval approaches \(\hat{y}_0 \pm t_c \sigma\); it never collapses to a point.
NotePredictions are most precise near \(\bar{x}\)
Both intervals are narrowest at \(x_0 = \bar{x}\) and widen as \(x_0\) moves away from the center of the data. The term \((x_0 - \bar{x})^2 / \sum(x_i - \bar{x})^2\) drives this: predicting far from the sample mean is less reliable.
Interactive: Prediction Interval vs. Confidence Interval
Use the slider to move the prediction point \(x_0\). Observe how both the confidence interval for \(E(y \mid x_0)\) and the prediction interval widen as \(x_0\) moves away from \(\bar{x}\), and notice that the prediction interval is always wider. The \(R^2\) gauge shows the fraction of variation explained by the model.
Figure 12.1: Prediction interval (outer band) vs. confidence interval for the mean (inner band). Both widen as x₀ moves away from x̄. The R² gauge shows overall model fit.
12.2 The Sum of Squares Decomposition
How much of the variation in \(y\) does the model actually explain? To answer this, decompose each observation’s deviation from the mean:
The Total Sum of Squares (SST) splits into the Regression Sum of Squares (SSR, variation explained by the model) and the Error Sum of Squares (SSE, variation left unexplained). This identity requires an intercept in the model; without one, the cross-product term does not vanish.
SST = total variation in \(y\). SSR = variation explained by \(x\). SSE = variation left over. The decomposition requires an intercept.
flowchart LR
SST["SST<br/>Total variation<br/>in y"] --> SSR["SSR<br/>Explained by<br/>the model"]
SST --> SSE["SSE<br/>Unexplained<br/>(residuals)"]
SSR --> R2["R² = SSR / SST"]
SSE --> R2
style SST fill:#1E5A96,color:#fff
style SSR fill:#2E8B57,color:#fff
style SSE fill:#C41E3A,color:#fff
style R2 fill:#D4A84B,color:#fff
Figure 12.2: The sum of squares decomposition splits total variation into explained and unexplained components.
12.3\(R^2\): The Coefficient of Determination
Definition 12.1 (Coefficient of Determination) The coefficient of determination is the fraction of total variation explained:
It is bounded between 0 and 1. An \(R^2\) of 1 means every observation falls exactly on the fitted line (\(SSE = 0\)); an \(R^2\) of 0 means \(x\) explains none of the variation in \(y\). In simple regression, \(R^2 = r_{xy}^2\), the squared sample correlation between \(x\) and \(y\). More generally, \(R^2 = \text{Corr}(y_i, \hat{y}_i)^2\), which extends to multiple regression.
WarningLimitations of \(R^2\)
\(R^2\) does not tell you whether the model is correctly specified (a high \(R^2\) with the wrong functional form still gives misleading predictions), whether the coefficients are statistically significant (a model can have \(R^2 = 0.90\) from a time trend while the regressor of interest is insignificant), or whether the model predicts well out of sample. Evaluate a model by its coefficients, residual plots, and theory, not by \(R^2\) alone.
What counts as “good”? Cross-sectional microdata: 0.10 to 0.40. Time series: 0.60 to 0.95. The answer depends entirely on the setting.
What counts as a “good” \(R^2\) depends entirely on context. Cross-sectional microdata (wages, spending) typically produce \(R^2\) values between 0.10 and 0.40 because individual behavior is noisy. Time series data (GDP, yields) often reach 0.60 to 0.95.
A quadratic in time gives \(R^2 \approx 1\) for GDP, but tells us nothing about causes
“\(R^2 = 0.15\) means the model is bad”
In micro studies, 0.15 is typical and the coefficients can still be well-estimated
“Adding variables always improves the model”
\(R^2\) can only rise, but precision can fall; see adjusted \(R^2\)
12.4 Effects of Data Scaling
Rescaling \(x\) by a constant \(c\) (for example, changing income from hundreds of dollars to dollars) divides the slope and its standard error by \(c\), but leaves \(t\)-statistics, \(p\)-values, and \(R^2\) unchanged. Rescaling \(y\) multiplies all coefficients and standard errors by \(c\), again leaving \(t\)-statistics and \(R^2\) untouched.
WarningScaling preserves all inferential conclusions
\(t\)-statistics, \(p\)-values, confidence interval coverage, \(R^2\), and \(F\)-statistics are all invariant to linear rescaling of the data. Only the units of the coefficients change. If someone asks whether your results change if you measure income in thousands instead of dollars, the answer is no.
Scaling rule: multiply \(x\) by \(c\)\(\implies\) divide \(b_2\) and \(\text{se}(b_2)\) by \(c\). Multiply \(y\) by \(c\)\(\implies\) multiply all coefficients and SEs by \(c\). Ratios (\(t\), \(F\), \(R^2\)) stay the same.
\(\implies\) Scaling is a change in units, not a change in the relationship. Statistical conclusions are invariant to the choice of units.
12.5 Practice
A researcher estimates the food expenditure model (\(N = 40\)) and reports \(\hat{\sigma}^2 = 8013\), \(\bar{x} = 19.60\), and \(\sum(x_i - \bar{x})^2 = 1462\). Compute the standard error of the forecast at \(x_0 = 20\) and construct a 95% prediction interval given \(\hat{y}_0 = 287.62\) and \(t_{0.975, 38} = 2.024\).
The 95% prediction interval is \(287.62 \pm 2.024 \times 90.63 = [104.15, \; 471.09]\). This range is enormous: we predict somewhere between $104 and $471 for a household earning $2,000/week. The width reflects the large individual-level noise (\(\sigma^2 = 8013\)), which no amount of data eliminates.