15  Interpreting MR and Assumptions

What ‘Holding All Else Constant’ Actually Means

Multiple Regression
Assumptions
Multicollinearity
Author

Jake Anderson

Published

March 21, 2026

Modified

March 26, 2026

Abstract

Every multiple regression coefficient is a partial effect. This chapter explains what “holding constant” means computationally via the Frisch-Waugh-Lovell theorem, walks through assumptions MR1 through MR6, and distinguishes between perfect multicollinearity (a showstopper) and near multicollinearity (a nuisance).

15.1 Ceteris Paribus: Partial Effects

In the model \(E(y \mid x_2, \ldots, x_K) = \beta_1 + \beta_2 x_2 + \cdots + \beta_K x_K\), each slope coefficient is a partial derivative:

Definition 15.1 (Partial Effect) \[ \beta_k = \frac{\partial\, E(y \mid x_2, \ldots, x_K)}{\partial\, x_k} \tag{15.1}\]

The change in \(E(y)\) when \(x_k\) increases by one unit, holding all other explanatory variables fixed.

Compare this to simple regression, where the slope captures the total association. In multiple regression, the qualifier “holding \(x_3, x_4, \ldots\) constant” is essential; without it, the interpretation is incomplete.

Total vs. partial effect: In SLR, \(\beta_2\) captures everything that moves with \(x\). In MR, \(\beta_k\) captures only the part of \(x_k\)’s association with \(y\) that is independent of the other regressors.

If you omit a variable from the model, you cannot claim to hold it constant. The omitted variable’s influence leaks into the included coefficients via omitted variable bias.

15.2 The Frisch-Waugh-Lovell Theorem

How does OLS actually “hold other variables constant”? The Frisch-Waugh-Lovell (FWL) theorem provides a precise answer. To isolate \(b_3\) (the advertising coefficient) in a model with both price and advertising, FWL says: (1) regress SALES on PRICE and save the residuals \(\widetilde{\text{SALES}}\); (2) regress ADVERT on PRICE and save the residuals \(\widetilde{\text{ADVERT}}\); (3) regress \(\widetilde{\text{SALES}}\) on \(\widetilde{\text{ADVERT}}\). The slope from step 3 is exactly \(b_3\) from the full regression.

FWL in one sentence: OLS “holds PRICE constant” by stripping out what PRICE explains from both sides, then measuring the leftover association.

The residuals from steps 1 and 2 represent the parts of SALES and ADVERT that cannot be predicted by PRICE. Step 3 then asks: once we strip out everything that PRICE explains about both variables, does the leftover variation in advertising still predict the leftover variation in sales?

The partialled-out regression does not account for parameters estimated in the earlier steps, so its standard errors are too small. Always use the full model for inference.

Interactive: Ceteris Paribus Visualizer

This visualizer shows wage data colored by experience level. Use the slider to fix experience at different values and see how the education-wage relationship looks when experience is held constant. Observations near the chosen experience level are highlighted; others fade out.

Show code
viewof fixedExper = Inputs.range([1, 40], {value: 15, step: 1, label: "Fix experience at"})
viewof bandWidth = Inputs.range([2, 10], {value: 5, step: 1, label: "Experience window (±)"})

cp_data = {
  const rng = d3.randomLcg(55);
  const rnorm = d3.randomNormal.source(rng)(0, 1);
  const N = 300;

  const educ = Array.from({length: N}, () => 6 + rng() * 14);
  const exper = Array.from({length: N}, () => 1 + rng() * 39);
  const wage = educ.map((e, i) =>
    3 + 1.5 * e + 0.8 * exper[i] - 0.01 * exper[i]**2 + rnorm() * 5
  );

  // Classify by experience level
  const points = educ.map((e, i) => ({
    educ: e,
    wage: wage[i],
    exper: exper[i],
    near: Math.abs(exper[i] - fixedExper) <= bandWidth,
    experGroup: exper[i] < 10 ? "0-10" : exper[i] < 20 ? "10-20" : exper[i] < 30 ? "20-30" : "30-40"
  }));

  // Fit regression line through "near" points only
  const nearPts = points.filter(p => p.near);
  if (nearPts.length < 5) return {points, slope: null, intercept: null, nearPts};

  const xbar = d3.mean(nearPts, d => d.educ);
  const ybar = d3.mean(nearPts, d => d.wage);
  const Sxx = d3.sum(nearPts.map(d => (d.educ - xbar)**2));
  const Sxy = d3.sum(nearPts.map(d => (d.educ - xbar)*(d.wage - ybar)));
  const slope = Sxy / Sxx;
  const intercept = ybar - slope * xbar;

  return {points, slope, intercept, nearPts, nNear: nearPts.length};
}

Plot.plot({
  width: 650,
  height: 400,
  marginLeft: 50,
  x: {label: "Education (years)", domain: [5, 22]},
  y: {label: "Wage ($/hr)"},
  color: {legend: true, label: "Experience group"},
  marks: [
    Plot.dot(cp_data.points.filter(p => !p.near), {
      x: "educ", y: "wage", fill: "experGroup", r: 2.5, fillOpacity: 0.15
    }),
    Plot.dot(cp_data.points.filter(p => p.near), {
      x: "educ", y: "wage", fill: "experGroup", r: 4, fillOpacity: 0.8,
      stroke: "#333", strokeWidth: 0.5
    }),
    cp_data.slope !== null ? Plot.line(
      [{x: 6, y: cp_data.intercept + cp_data.slope * 6},
       {x: 20, y: cp_data.intercept + cp_data.slope * 20}],
      {x: "x", y: "y", stroke: "#C41E3A", strokeWidth: 2.5}
    ) : null
  ]
})
Show code
html`<div style="margin-top:0.5em">
  <strong>Showing observations with experience ∈ [${fixedExper - bandWidth}, ${fixedExper + bandWidth}]</strong>
  (${cp_data.nNear || 0} observations)
  ${cp_data.slope !== null ? html`<br/>Estimated slope of education (holding experience ≈ ${fixedExper}): <strong>${cp_data.slope.toFixed(3)}</strong>` : ""}
</div>`
(a)
(b)
(c)
(d)
(e)
Figure 15.1: Ceteris paribus visualizer. Fixing experience at different levels reveals the partial effect of education on wages. Points near the chosen experience level are highlighted.

As you move the experience slider, the highlighted subset changes and so does the fitted line. The slope of education should remain roughly stable if the model is correctly specified; this is the partial effect.

15.3 Assumptions MR1 through MR6

The assumptions generalize directly from simple regression. Five carry over unchanged; only one is genuinely new.

Summary of MR assumptions.
Assumption Statement What it buys you
MR1 \(y_i = \beta_1 + \beta_2 x_{i2} + \cdots + \beta_K x_{iK} + e_i\) Correct model specification
MR2 \(E(e_i \mid \mathbf{X}) = 0\) (strict exogeneity) Unbiasedness: \(E(b_k) = \beta_k\)
MR3 \(\text{Var}(e_i \mid \mathbf{X}) = \sigma^2\) (homoskedasticity) Correct standard errors
MR4 \(\text{Cov}(e_i, e_j \mid \mathbf{X}) = 0\) for \(i \neq j\) No serial correlation
MR5 No exact linear relationship among \(x\)’s OLS can be computed
MR6 \(e_i \mid \mathbf{X} \sim N(0, \sigma^2)\) (optional) Exact \(t_{(N-K)}\) inference
  • MR1 fails: Wrong functional form (e.g., linear when the true relationship is quadratic) or omitted relevant variables. Residual plots show systematic patterns.
  • MR2 fails: Endogeneity. Coefficients are biased; no fix within OLS.
  • MR3 fails: Heteroskedasticity. Coefficients are unbiased, but standard errors and \(t\)-tests are wrong.
  • MR4 fails: Autocorrelation. Same consequences as MR3.
  • MR5 fails: Perfect multicollinearity. Software throws an error or drops a variable.
  • MR6 fails: Non-normal errors. In large samples (\(N > 30\)), the CLT covers you. In small samples, exact \(t\)- and \(F\)-distributions are not valid.

Under MR1 through MR5, the Gauss-Markov theorem guarantees OLS is Best Linear Unbiased (BLUE). Adding MR6 gives exact \(t\)- and \(F\)-distributions in finite samples. Without MR6, the Central Limit Theorem still provides approximate normality in large samples.

flowchart TD
    A["MR1: Correct specification"] --> B["MR2: Exogeneity"]
    B --> C["MR1-MR2: OLS is unbiased"]
    C --> D["MR3: Homoskedasticity<br/>MR4: No autocorrelation<br/>MR5: No perfect collinearity"]
    D --> E["MR1-MR5: OLS is BLUE<br/>(Gauss-Markov)"]
    E --> F["MR6: Normal errors"]
    F --> G["MR1-MR6: Exact t and F<br/>distributions"]

    style C fill:#2E8B57,color:#fff
    style E fill:#1E5A96,color:#fff
    style G fill:#D4A84B,color:#fff
Figure 15.2: Hierarchy of MR assumptions. Each level builds on the previous one, adding stronger guarantees.

15.4 Perfect vs. Near Multicollinearity

MR5 requires that no regressor is an exact linear function of the others. Classic violations include including both \(\text{age}\) and \(\text{birth\_year}\) (since one is a deterministic function of the other), including all \(g\) category dummies plus an intercept (the dummy variable trap), or including budget shares that sum to 1. When MR5 fails, OLS cannot be computed: the normal equations have no unique solution.

WarningPerfect vs. near collinearity: different problems entirely

Perfect collinearity violates MR5 and is a showstopper; OLS cannot run. Near collinearity is a nuisance; OLS runs and remains BLUE, but standard errors inflate. Do not confuse the two.

Near multicollinearity is different. Regressors are highly correlated but not perfectly so. MR5 is not violated, OLS is still BLUE, and the coefficients are still unbiased. The problem is purely about precision. The variance formula for \(b_2\) in a three-variable model is:

Theorem 15.1 (Variance Under Collinearity) \[ \text{Var}(b_2 \mid \mathbf{X}) = \frac{\sigma^2}{(1 - r_{23}^2) \sum(x_{i2} - \bar{x}_2)^2} \tag{15.2}\]

As \(|r_{23}| \to 1\), the factor \((1 - r_{23}^2)\) shrinks toward zero and the variance explodes. At \(r_{23} = 0.9\), the variance is 5.3 times what it would be with uncorrelated regressors. At \(r_{23} = 0.99\), it is 50 times larger.

\(|r_{23}|\) \(1/(1 - r_{23}^2)\) Interpretation
0.0 1.0 Baseline (no collinearity)
0.5 1.3 Mild; barely noticeable
0.8 2.8 Moderate; SEs nearly tripled
0.9 5.3 Problematic; SEs over 5x baseline
0.95 10.3 Severe; \(t\)-tests have little power
0.99 50.3 Extreme; coefficients nearly unidentifiable
  • High overall \(R^2\) but individually insignificant \(t\)-statistics
  • Large standard errors that seem out of proportion
  • Coefficients that flip sign or change magnitude when one observation is dropped
  • Joint significance (via \(F\)-test) alongside individual insignificance

Collinearity is a data problem, not a model problem. The model is fine; the data do not contain enough independent variation to pin down individual coefficients precisely.

\(\implies\) Near collinearity does not bias OLS, but it inflates standard errors and makes individual coefficients hard to pin down. We cover diagnosis (via the Variance Inflation Factor) and remedies in Model Specification.

15.5 Practice

A researcher estimates \(\text{wage}_i = \beta_1 + \beta_2 \text{educ}_i + \beta_3 \text{exper}_i + \beta_4 \text{female}_i + e_i\) using cross-sectional data. Which MR assumption is most likely violated, and why?

MR2 (strict exogeneity) is the hardest to defend. The error term \(e_i\) contains unobserved factors like ability and motivation. If ability is correlated with education (more able people get more schooling), then \(\text{Cov}(\text{educ}_i, e_i) \neq 0\), and the education coefficient absorbs part of the ability effect. This is omitted variable bias. MR3 might also be suspect (wage variability could differ by education level), but MR2 is the most consequential violation because it causes bias, not just inefficiency.

Slides