19 Model Specification, Multicollinearity, and Model Selection

What Happens When You Leave Something Out, Put Too Much In, or Can’t Tell Them Apart

Model Specification

Multicollinearity

Model Selection

Author

Jake Anderson

Published

March 21, 2026

Modified

March 26, 2026

Abstract

Omitting a relevant variable causes bias; including an irrelevant variable inflates standard errors. This chapter revisits the OVB formula with numerical examples, introduces the RESET test for misspecification, develops the Variance Inflation Factor (VIF) for diagnosing collinearity, and presents AIC and BIC for choosing among competing models.

19.1 Omitted Variable Bias Revisited

We introduced Omitted Variable Bias (OVB) in Chapter 13. Here we add a numerical example and emphasize that OVB does not vanish with more data.

If the true model is \(y = \beta_1 + \beta_2 x + \beta_3 z + e\) but we omit \(z\), the OLS estimate of \(\beta_2\) converges to the wrong value:

\[ b_2^* \xrightarrow{p} \beta_2 + \beta_3 \frac{\text{Cov}(x, z)}{\text{Var}(x)} \tag{19.1}\]

Koop-Tobias data: Without an ability proxy, the return to education is 7.3%. With one (test scores), it falls to 5.9%. The omitted ability variable inflated the education coefficient by about 24%.

The direction of bias depends on two signs: the effect of the omitted variable on \(y\) (\(\beta_3\)) and its correlation with the included regressor. Using Koop-Tobias wage data (\(N = 1{,}057\)): without an ability proxy, the estimated return to education is 7.3%; with an ability proxy (test scores), it falls to 5.9%. The omitted ability variable inflated the education coefficient by about 24%.

Two conditions eliminate OVB: \(\beta_3 = 0\) (the omitted variable does not affect \(y\)) or \(\text{Cov}(x, z) = 0\) (the omitted variable is uncorrelated with the included regressor). An omitted variable is only a problem when both conditions fail.

19.2 Including Irrelevant Variables

The opposite mistake is including a variable that does not belong in the model. If \(\beta_4 = 0\) in the true model but you include shoe size alongside education and ability, the estimates of \(\beta_1\), \(\beta_2\), and \(\beta_3\) remain unbiased. The cost is larger standard errors: the variance formula includes a factor \(1/(1 - R_k^2)\) from the auxiliary regression of \(x_k\) on all other regressors, and adding shoe size increases \(R_k^2\) even slightly, inflating \(\text{Var}(b_k)\).

\(\implies\) Given a choice between the two mistakes, including an irrelevant variable is the lesser evil. You pay with precision, not with bias. But this does not mean “throw everything in.” More variables means less precision and wider confidence intervals. The goal is a well-specified model.

Mistake	Consequence for \(\hat{\beta}\)	Consequence for SE
Omit relevant variable	Biased and inconsistent	Can be larger or smaller
Include irrelevant variable	Unbiased and consistent	Larger (inflated)

Bias cannot be fixed with more data; imprecision can. Omitting a relevant variable is almost always the more serious mistake. The exception is when including too many correlated regressors makes standard errors so large that nothing is significant; see Section 19.4 below.

19.3 The RESET Test

The Regression Specification Error Test (RESET) checks for functional form misspecification and omitted variables. The idea: if the model is correctly specified, nonlinear functions of the fitted values \(\hat{y}\) should not add explanatory power.

RESET logic: If the model is right, \(\hat{y}^2\) and \(\hat{y}^3\) should be noise. If they are significant, the model is missing nonlinearities or variables.

The procedure is: (1) estimate the model and compute \(\hat{y}\); (2) estimate the augmented model \(y = \beta_1 + \beta_2 x_2 + \beta_3 x_3 + \gamma_1 \hat{y}^2 + \gamma_2 \hat{y}^3 + e\); (3) test \(H_0: \gamma_1 = \gamma_2 = 0\) with an F-test.

Rejecting \(H_0\) signals misspecification, but RESET does not tell you what is wrong, only that something is. Failure to reject is not conclusive evidence that the model is correct; RESET has limited power against certain alternatives. It is a general-purpose diagnostic, not a targeted test.

flowchart TD
    A["Model specification<br/>choices"] --> B["Omit relevant<br/>variable"]
    A --> C["Include irrelevant<br/>variable"]
    A --> D["Wrong functional<br/>form"]
    B --> E["Biased, inconsistent<br/>(does not vanish with N)"]
    C --> F["Unbiased, but<br/>larger standard errors"]
    D --> G["Biased predictions<br/>RESET test detects this"]

    style B fill:#C41E3A,color:#fff
    style C fill:#D4A84B,color:#fff
    style D fill:#C41E3A,color:#fff
    style E fill:#C41E3A,color:#fff
    style F fill:#D4A84B,color:#fff

Figure 19.1: Two types of specification error and their consequences. Omission causes bias; inclusion causes inefficiency.

19.4 Multicollinearity and the Variance Inflation Factor

When regressors are highly correlated, OLS remains unbiased and BLUE, but the estimates become imprecise. The Variance Inflation Factor (VIF) quantifies how much collinearity inflates the variance of a coefficient.

To compute VIF for regressor \(x_2\): regress \(x_2\) on all other regressors and record \(R_2^2\). Then:

Definition 19.1 (Variance Inflation Factor) \[ \text{VIF} = \frac{1}{1 - R_2^2} \tag{19.2}\]

A VIF of 1 means no collinearity (baseline variance). A VIF of 10 means the variance is 10 times the baseline (\(R_2^2 = 0.9\)). The rule of thumb is that VIF above 10 signals problematic collinearity.

VIF above 10 does not mean “drop the variable”

High VIF means imprecise individual estimates, not biased ones. If theory says the variable belongs, keep it. Dropping it causes OVB. The correct response is to report joint significance (via the \(F\)-test) and acknowledge the imprecision.

Symptoms of collinearity include: a high overall \(R^2\) with individually insignificant \(t\)-statistics, large standard errors, unstable coefficients that change substantially when one observation is added or removed, and joint significance (via the \(F\)-test) alongside individual insignificance. The critical point is that you should not drop a variable just because its \(t\)-statistic is insignificant if the variable belongs in the model theoretically; removing it causes OVB.

Interactive: VIF Explorer

Drag the slider to increase the correlation between \(x_2\) and \(x_3\). Watch the VIF climb and the sampling distribution of \(\hat{\beta}_2\) widen as multicollinearity increases.

Show code

viewof rho_vif = Inputs.range([0, 0.99], {value: 0.5, step: 0.01, label: "Corr(x₂, x₃)"})

vif_data = {
  const rho = rho_vif;
  const VIF = 1 / (1 - rho * rho);
  const beta2_true = 2.0;
  const sigma2 = 10;
  const Sxx = 100; // sum of squared deviations of x2

  // Variance of b2
  const varB2 = sigma2 / ((1 - rho * rho) * Sxx);
  const seB2 = Math.sqrt(varB2);

  // Baseline (rho = 0)
  const varB2_base = sigma2 / Sxx;
  const seB2_base = Math.sqrt(varB2_base);

  // Generate sampling distribution
  const xRange = d3.range(beta2_true - 4 * seB2, beta2_true + 4 * seB2, 0.02);
  const dist = xRange.map(x => ({
    x,
    y: Math.exp(-0.5 * ((x - beta2_true) / seB2) ** 2) / (seB2 * Math.sqrt(2 * Math.PI))
  }));

  const baseDist = xRange.map(x => ({
    x,
    y: Math.exp(-0.5 * ((x - beta2_true) / seB2_base) ** 2) / (seB2_base * Math.sqrt(2 * Math.PI))
  }));

  return {VIF, rho, varB2, seB2, dist, baseDist, beta2_true};
}

Plot.plot({
  width: 600, height: 300,
  x: {label: "β̂₂", domain: [vif_data.beta2_true - 3, vif_data.beta2_true + 3]},
  y: {label: "Density"},
  marks: [
    Plot.areaY(vif_data.baseDist, {x: "x", y: "y", fill: "#2E8B57", fillOpacity: 0.15}),
    Plot.line(vif_data.baseDist, {x: "x", y: "y", stroke: "#2E8B57", strokeWidth: 1.5, strokeDasharray: "4,4"}),
    Plot.areaY(vif_data.dist, {x: "x", y: "y", fill: "#1E5A96", fillOpacity: 0.25}),
    Plot.line(vif_data.dist, {x: "x", y: "y", stroke: "#1E5A96", strokeWidth: 2}),
    Plot.ruleX([vif_data.beta2_true], {stroke: "#C41E3A", strokeDasharray: "6,4", strokeWidth: 1.5})
  ]
})

Show code

html`<div style="display:flex; gap:2em; flex-wrap:wrap; margin-top:0.5em">
  <div><strong>Corr(x₂, x₃):</strong> ${vif_data.rho.toFixed(2)}</div>
  <div><strong>VIF:</strong> ${vif_data.VIF.toFixed(1)}</div>
  <div><strong>se(β̂₂):</strong> ${vif_data.seB2.toFixed(3)}</div>
  <div style="color:#2E8B57">Dashed = baseline (ρ = 0)</div>
  <div style="color:#C41E3A">Red line = true β₂</div>
</div>
<div style="margin-top:0.3em; font-style:italic">
${vif_data.VIF > 10 ? "VIF > 10: problematic collinearity. Standard errors are severely inflated." :
  vif_data.VIF > 5 ? "VIF between 5 and 10: moderate collinearity. Standard errors are noticeably larger." :
  "VIF < 5: collinearity is mild. Precision is near baseline."}
</div>`

(a)

(b)

(c)

(d)

Figure 19.2: VIF explorer. As the correlation between regressors increases, VIF rises and the sampling distribution of β̂₂ spreads out. OLS remains unbiased (centered on the true value), but precision deteriorates.

Both distributions are centered on the true \(\beta_2\) (red dashed line). Collinearity does not cause bias; it causes imprecision. The blue curve spreads as \(\rho\) increases.

Remedies for collinearity

Getting more data can help if the new observations provide independent variation. Imposing restrictions from economic theory (e.g., constant returns to scale) reduces the number of free parameters and narrows confidence intervals, but introduces bias if the restriction is wrong.

19.5 Model Selection: AIC and BIC

When choosing among competing models, \(R^2\) is inadequate because it always increases with more variables. Three criteria that penalize complexity are commonly used.

The adjusted \(R^2\) (\(\bar{R}^2\)) increases only when the added variable’s \(|t| > 1\), corresponding to a significance level of about 32%. This threshold is too lenient for most applications.

The Akaike Information Criterion (AIC) and the Schwarz (Bayesian) Information Criterion (BIC) provide sharper penalties:

Definition 19.2 (AIC and BIC) \[ \text{AIC} = \ln\!\left(\frac{SSE}{N}\right) + \frac{2K}{N} \qquad \text{BIC} = \ln\!\left(\frac{SSE}{N}\right) + \frac{K \ln(N)}{N} \tag{19.3}\]

Both have the same structure: a goodness-of-fit term plus a complexity penalty. Smaller values are better.

BIC penalizes more heavily than AIC whenever \(\ln(N) > 2\), which holds for any sample size \(N \geq 8\). In practice, BIC is the more conservative criterion: it tends to select simpler models.

Criterion	Penalty per parameter	Tends to select	When to use
\(\bar{R}^2\)	\(\approx \|t\| > 1\) threshold	Larger models	Quick comparison; least conservative
AIC	\(2K/N\)	Moderate models	Forecasting; minimizing prediction error
BIC	\(K\ln(N)/N\)	Smaller models	Consistent model selection; large \(N\)

You can only compare AIC and BIC across models with the same dependent variable. Comparing a model for \(y\) against a model for \(\ln(y)\) using these criteria is not valid. Use the generalized \(R^2\) instead (see Chapter 12).

Same dependent variable required

19.6 Practice

A researcher estimates a Cobb-Douglas production function \(\ln(\text{PROD}) = \beta_1 + \beta_2 \ln(\text{AREA}) + \beta_3 \ln(\text{LABOR}) + \beta_4 \ln(\text{FERT}) + e\) and finds that \(\ln(\text{AREA})\) and \(\ln(\text{LABOR})\) are both individually insignificant (\(p = 0.198\) and \(p = 0.465\)), but jointly significant (\(p = 0.002\)). The VIF for \(\ln(\text{LABOR})\) is 17.9. What should the researcher do?

Show Solution

The high VIF (17.9 > 10) confirms problematic collinearity between AREA and LABOR: on Philippine rice farms, bigger farms use more labor, so the two variables move together. Dropping either variable would cause OVB because both belong in the production function. Instead, the researcher should (1) report the joint \(F\)-test result to demonstrate that area and labor are jointly significant, and (2) consider getting more data (e.g., combining 1993 and 1994 data) to increase independent variation.

Slides

Download handout slides (PDF)

Download presentation slides with transitions (PDF)