20  Indicator Variables

Putting Qualitative Information into a Regression

Indicator Variables
Dummy Variables
Chow Test
Author

Jake Anderson

Published

March 21, 2026

Modified

March 26, 2026

Abstract

Gender, region, and industry are not numbers on a continuous scale, but they affect outcomes. Indicator (dummy) variables encode categorical information as 0/1 regressors. This chapter develops intercept and slope indicators, the dummy variable trap, the Chow test for structural equivalence, and the exact percentage interpretation in log-linear models.

20.1 Intercept Indicators: Parallel Shifts

An indicator variable equals 1 if a characteristic is present and 0 otherwise. Putting \(\text{female}_i\) into a wage regression:

Definition 20.1 (Intercept Dummy Model) \[ \text{wage}_i = \beta_1 + \delta\,\text{female}_i + \beta_2\,\text{educ}_i + e_i \tag{20.1}\]

For men (\(\text{female} = 0\)): \(E(\text{wage}) = \beta_1 + \beta_2 \text{educ}\). For women (\(\text{female} = 1\)): \(E(\text{wage}) = (\beta_1 + \delta) + \beta_2 \text{educ}\). The model produces two parallel lines with the same slope but different intercepts. The coefficient \(\delta\) is the wage difference between women and men, holding education constant.

Reference group: The group coded 0. All indicator coefficients are measured relative to this group. Changing the reference group changes the sign and magnitude of the coefficients, but not the model’s predictions.

The group coded 0 is the reference group (or base group). With \(\text{female}_i\), males are the reference. The sign of \(\delta\) depends on the coding: defining \(\text{male}_i = 1 - \text{female}_i\) gives \(\delta^* = -\delta\). The choice of reference group changes the sign of the coefficient but not the model’s predictions.

20.2 Multiple Categories and the Dummy Variable Trap

For \(g\) categories (such as four U.S. regions), include \(g - 1\) indicator variables. One category must be omitted as the reference group. If you include all \(g\) indicators plus an intercept, the sum of the indicator columns equals the intercept column for every observation. This creates exact collinearity, violating MR5, and OLS cannot run. This is the dummy variable trap.

CautionThe dummy variable trap

With \(g\) categories, include exactly \(g - 1\) indicators. Including all \(g\) plus an intercept creates perfect collinearity. Most software will silently drop one variable; it is better to choose deliberately which group is the reference.

Each coefficient \(\delta_j\) measures the difference between group \(j\) and the omitted reference group, holding other variables constant. To compare two non-reference groups, take the difference of their coefficients and use a linear combination test for inference.

flowchart LR
    A["g categories<br/>(e.g., 4 regions)"] --> B{"Include all g<br/>dummies + intercept?"}
    B -->|Yes| C["Perfect collinearity<br/>(OLS fails)"]
    B -->|No| D["Include g - 1 dummies<br/>+ intercept"]
    D --> E["One category is the<br/>reference group"]
    E --> F["Each δⱼ = group j<br/>minus reference"]

    style C fill:#C41E3A,color:#fff
    style D fill:#2E8B57,color:#fff
    style F fill:#1E5A96,color:#fff
Figure 20.1: The dummy variable trap. With g categories, using g - 1 indicators avoids perfect collinearity.

20.3 Slope Dummies: Different Slopes for Different Groups

The intercept-dummy model forces the same slope for both groups. To allow different returns to education by gender, add an interaction term (a slope indicator):

\[ \text{wage}_i = \beta_1 + \delta\,\text{female}_i + \beta_2\,\text{educ}_i + \gamma\,(\text{female}_i \times \text{educ}_i) + e_i \tag{20.2}\]

Now each group has its own intercept and slope. The coefficient \(\gamma\) measures how much the return to education differs for women relative to men. The full wage gap at education level \(e\) is \(\delta + \gamma \cdot e\), which is not a single number; you must specify at what education level to evaluate it.

The slope dummy \(\gamma\) is an interaction term applied to an indicator variable. The mechanics of computing marginal effects and testing significance are identical.

This is the same interaction term from the previous chapter, applied specifically to an indicator variable. The mechanics of computing marginal effects and testing significance carry over directly.

Interactive: Dummy Variable Regression Visualizer

Use the sliders to adjust the intercept shift (\(\delta\)) and slope shift (\(\gamma\)). Toggle between “intercept only” (parallel lines) and “intercept + slope” (different slopes) models. Watch the two regression lines update.

Show code
viewof delta_dummy = Inputs.range([-8, 4], {value: -3.2, step: 0.1, label: "δ (intercept shift)"})
viewof gamma_dummy = Inputs.range([-2, 2], {value: -0.4, step: 0.05, label: "γ (slope shift)"})
viewof model_type = Inputs.radio(["Intercept only", "Intercept + slope"], {value: "Intercept only", label: "Model type"})

dummy_plot = {
  const b1 = 5.0, b2 = 1.8;
  const delta = delta_dummy;
  const gamma = model_type === "Intercept + slope" ? gamma_dummy : 0;

  const educRange = d3.range(6, 21, 0.5);

  const refLine = educRange.map(e => ({educ: e, wage: b1 + b2 * e, group: "Reference (male)"}));
  const indLine = educRange.map(e => ({
    educ: e,
    wage: (b1 + delta) + (b2 + gamma) * e,
    group: "Indicator (female)"
  }));

  // Simulated data
  const rng = d3.randomLcg(88);
  const rnorm = d3.randomNormal.source(rng)(0, 3);
  const N = 120;
  const data = Array.from({length: N}, () => {
    const female = rng() > 0.5 ? 1 : 0;
    const educ = 8 + rng() * 12;
    const wage = b1 + b2 * educ + delta * female + gamma * female * educ + rnorm();
    return {educ, wage, group: female ? "Indicator (female)" : "Reference (male)"};
  });

  const slopeRef = b2;
  const slopeInd = b2 + gamma;
  const interceptRef = b1;
  const interceptInd = b1 + delta;

  return {refLine, indLine, data, slopeRef, slopeInd, interceptRef, interceptInd, gamma, delta};
}

Plot.plot({
  width: 650, height: 380,
  marginLeft: 50,
  x: {label: "Education (years)", domain: [6, 21]},
  y: {label: "Wage ($/hr)"},
  color: {domain: ["Reference (male)", "Indicator (female)"], range: ["#1E5A96", "#C41E3A"], legend: true},
  marks: [
    Plot.dot(dummy_plot.data, {x: "educ", y: "wage", fill: "group", r: 2.5, fillOpacity: 0.3}),
    Plot.line(dummy_plot.refLine, {x: "educ", y: "wage", stroke: "#1E5A96", strokeWidth: 2.5}),
    Plot.line(dummy_plot.indLine, {x: "educ", y: "wage", stroke: "#C41E3A", strokeWidth: 2.5})
  ]
})
Show code
html`<div style="margin-top:0.5em">
  <strong>Reference:</strong> intercept = ${dummy_plot.interceptRef.toFixed(1)}, slope = ${dummy_plot.slopeRef.toFixed(2)}<br/>
  <strong>Indicator:</strong> intercept = ${dummy_plot.interceptInd.toFixed(1)}, slope = ${dummy_plot.slopeInd.toFixed(2)}<br/>
  <strong>δ:</strong> ${dummy_plot.delta.toFixed(1)} &nbsp;|&nbsp;
  <strong>γ:</strong> ${dummy_plot.gamma.toFixed(2)}
  ${model_type === "Intercept only" ? html`<br/><em>Lines are parallel (intercept shift only). Switch to "Intercept + slope" to allow different slopes.</em>` : ""}
</div>`
(a)
(b)
(c)
(d)
(e)
(f)
Figure 20.2: Dummy variable regression visualizer. Adjust δ (intercept shift) and γ (slope shift) to see how indicator variables change the regression lines for each group.

20.4 The Chow Test

Do we actually need separate regressions for men and women, or is a single pooled regression adequate? The Chow test answers this with a joint \(F\)-test on all the indicator-related coefficients:

\[ H_0: \delta = 0 \text{ and } \gamma = 0 \qquad H_1: \text{at least one is nonzero} \tag{20.3}\]

Estimate the restricted model (pooled, no indicators) to get \(SSE_R\), and the unrestricted model (with all indicators) to get \(SSE_U\). The F-statistic follows the same formula. The Chow test requires equal error variances across groups; if variances differ, the standard \(F\)-test is not valid.

WarningChow test assumes equal error variances

If \(\text{Var}(e_i \mid \text{male}) \neq \text{Var}(e_i \mid \text{female})\), the \(F\)-test distribution is wrong. In that case, use robust standard errors or the Goldfeld-Quandt variant.

Equivalently, \(SSE_U\) equals the sum of \(SSE\)s from running separate regressions for each group: \(SSE_U = SSE_{\text{male}} + SSE_{\text{female}}\).

Chow test = F-test on all group indicators. It tests whether the regression is structurally different across groups. If it rejects, you need separate slopes and/or intercepts.

20.5 Indicators in Log-Linear Models

When the dependent variable is in logs, the indicator coefficient has a percentage interpretation. In the model \(\ln(\text{wage}_i) = \beta_1 + \beta_2 \text{educ}_i + \delta\,\text{female}_i + e_i\), the rough approximation is that women earn about \(100\delta\)% less than men. For larger coefficients (\(|\delta| > 0.10\)), use the exact formula:

Theorem 20.1 (Exact Percentage Interpretation) \[ \text{Percentage difference} = 100(e^{\delta} - 1)\% \tag{20.4}\]

For example, \(\hat{\delta} = -0.178\) gives the rough approximation \(-17.8\)% and the exact calculation \(100(e^{-0.178} - 1) = -16.3\)%.

\(\hat{\delta}\) Rough: \(100\delta\)% Exact: \(100(e^{\delta} - 1)\)% Error
\(-0.05\) \(-5.0\)% \(-4.9\)% 0.1 pp
\(-0.10\) \(-10.0\)% \(-9.5\)% 0.5 pp
\(-0.20\) \(-20.0\)% \(-18.1\)% 1.9 pp
\(-0.50\) \(-50.0\)% \(-39.3\)% 10.7 pp
\(+0.30\) \(+30.0\)% \(+35.0\)% 5.0 pp

For \(|\delta| < 0.10\): the rough approximation is fine (error under 1 pp). For \(|\delta| \geq 0.10\): use the exact formula. On exams, use the exact formula to be safe unless told otherwise.

20.6 Practice

A researcher tests whether the wage equation differs for Southern vs. non-Southern workers. The pooled model gives \(SSE_R = 214{,}400.9\), the fully interacted model gives \(SSE_U = 213{,}774.0\), the number of restrictions is \(J = 5\), and \(N - K = 1190\). Conduct the Chow test at \(\alpha = 0.05\).

\[ F = \frac{(214{,}400.9 - 213{,}774.0) / 5}{213{,}774.0 / 1190} = \frac{626.9 / 5}{179.6} = \frac{125.4}{179.6} = 0.698 \]

The critical value \(F_{(0.95, 5, 1190)} \approx 2.22\). Since \(0.698 < 2.22\), we fail to reject \(H_0\). The wage equation is not significantly different in the South. A single pooled regression is adequate; we do not need separate models.

Slides