3  Omitted Variable Bias

What Happens When You Leave Out a Relevant Variable

Endogeneity
OLS Assumptions
Cross-Section
Author

Jake Anderson

Published

March 3, 2026

Modified

March 4, 2026

NotePrerequisites

This chapter assumes familiarity with OLS estimation and heteroskedasticity. For the solution to endogeneity, see Instrumental Variables.

The assumptions of Simple1 Linear Regression under random sampling are the following:

Assumption Description
RS1 The model is linear, i.e. \(y_{i}=\beta_{1}+\beta_{2} x_{i}+e_{i}\)
RS2 The data pairs \(\left(y_{i}, x_{i}\right)\) are independent and identically distributed (iid).
RS3 Exogeneity: \(E\left(e_{i} \mid x_{i}\right)=0\) for \(i=1, \ldots, N\)
RS4 Homoskedasticity: \(\operatorname{var}\left(e_{i} \mid x_{i}\right)=\sigma^{2}\).
RS5 Rank Condition: \(x_{i}\) takes at least two different values.
RS6 Normality: \(e_{i} \sim N\left(0, \sigma^{2}\right)\)

In each section of this course, we have relaxed one of these assumptions and explored how it affects the OLS estimator, as well as solutions on how we can address these issues. In this section, we will cover each of the following types of endogenous regressors that can cause issues in regression analysis:

  1. Omitted variable bias: When a relevant variable is left out of the model
  2. Simultaneity: When the regressor and dependent variable are determined simultaneously
  3. Measurement error: When the regressor is measured with error
  4. Selection bias: When the sample is not randomly selected
  5. Endogenous treatment effects: When the treatment assignment is correlated with unobservables

3.1 Omitted Variable Bias

Let’s take the example of test scores. What is important for predicting test scores? Let’s simplify it into two components:

  1. How prepared were you by the last course you took?
  2. How much time do you invest in the current course?

Let’s let your grade from last term be \(X_1\) and your study time in the current course be \(X_2\). Then your model might look like:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon\]

where \(Y\) is your test score and \(\varepsilon\) is the error term. Imagine two people who each claim that one of the variables doesn’t matter. Person 1 says “the past is irrelavent, it’s only about what you do now!” and person two says “you can’t learn anything without a good foundation!” Person 3 thinks they both have something to do with the outcome. Then they would run three competing regressions:

  1. Person 1’s model: \(Y = \beta_0 + \beta_1 X_1 + e_1\)
  2. Person 2’s model: \(Y = \beta_0 + \beta_2 X_2 + e_2\)
  3. Person 3’s model: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon\)

Let’s interpret the coefficients in each model. \(\beta_1\) represents the effect of \(X_1\) on \(Y\) in Person 1’s model, and \(\beta_2\) represents the effect of \(X_2\) on \(Y\) in Person 2’s model. In plain language, we say the following:

  1. In model (1) \(\beta_1\) is the expected change in test scores (\(Y\)) for a one unit increase in \(X_1\) (last term’s grade).
  2. In model (2) \(\beta_1\) is the expected change in test scores (\(Y\)) for a one unit increase in \(X_2\) (study time).
  3. In model (3) \(\beta_1\) is the expected change in test scores (\(Y\)) for a one unit increase in \(X_1\) (last term’s grade), holding \(X_2\) (study time) constant.
  4. In model (3) \(\beta_2\) is the expected change in test scores (\(Y\)) for a one unit increase in \(X_2\) (study time), holding \(X_1\) (last term’s grade) constant.

Imagine you run OLS on models (1), (2), and (3). Do you think that \(\hat{\beta_1}\) and \(\hat{\beta_2}\) will be the same in all three models? Can you think of situations where they would be the same, or when they would not be the same?

Some students have said “the coefficients can’t be the same, because there are more variables in the model!” But imagine a game with rules as follows:

  1. Each player starts at one end of a long hallway
  2. Each room you enter gives you 1 point, one point per room
  3. The rooms only connect to the hallway, not to one another
  4. You can enter as many rooms as you want
  5. You can only enter each room once

Let \(x_i, i \in \{1, 2, ... , k\}\) be the number of rooms you enter. Then the following model governs how many points total (\(Y\)) you get:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + \varepsilon \]

Each \(\beta_i\) represents the expected change in points (\(Y\)) for a one unit increase in \(x_i\) (number of rooms entered), holding all other \(x_j\) constant, and they are all equal to 1 (except for \(\beta_0\), since you start the game with 0 points). No matter what rooms you enter, you will always get 1 point per room. So the omitted variable does not matter at all! If every person’s choice to go into a room or not is random, then it will not affect the coefficient estimate.

Question: Assume there are two rooms, and the choice to go into a room is perfectly random (e.g. imagine a player flips a coin to decide whether to go into a room or not let the probability of entering a room be \(p\)). Consider the following regressions:

  1. \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon\)
  2. \(y = \beta_0 + \beta_1 x_1 + \varepsilon\)

What will be the approximate values of \(\hat{\beta_1}\) and \(\hat{\beta_2}\) in each model?

Answer

The model will output \(\beta_1 \approx 1\) and \(\beta_0 \approx p\) where \(p\) is the probability of entering a room. Because \(x_1\) and \(x_2\) are independent, omitting one doesn’t bias the coefficient on the other, but it does change the intercept. We sometimes call these types of variables “orthogonal” to the other variables in the model (\(x_1 \perp x_2\)).

It seems plausible that both a good foundation (last term’s grade) and study time are important for success in the current course. Then what happens when we run the first two models? The missing covariate will be ``hidden” in the error term. That is to say, that model (1) believes that there are two things that contribute to test scores: last term’s grade and random stuff. Similarly, model (2) believes that there are two things that contribute to test scores: study time and random stuff. Remember that for our least squares estimator to have nice properties, we require several assumptions, one of which is SR3: \(E[\varepsilon|X] = 0\). In words, this assumption means that for every value of \(X\), I think that the average value of the error term is zero. Another way of putting this is to say “besides the variables \(X_1, X_2, ... , X_n\) everything else that causes or is associated with \(Y\) is basically random”

3.1.1 Simulation: Seeing OVB in Action

Let’s simulate data to see how omitted variable bias works in practice. We’ll generate data for 1000 students where we know the true data generating process:

\[Y = 30 + 0.65 \cdot X_1 + 0.40 \cdot X_2 + \varepsilon\]

where:

  • \(\beta_0 = 30\) (baseline test score)
  • \(\beta_1 = 0.65\) (effect of prior preparation, i.e., last term’s grade)
  • \(\beta_2 = 0.40\) (effect of study time in hours per week)
  • \(X_1\) and \(X_2\) are positively correlated (\(\rho \approx 0.4\))

These values are roughly consistent with findings in education economics (e.g. Stinebrickner, 2008).

3.1.2 Running the Three Models

Now let’s run Person 1’s model (omitting prior preparation), Person 2’s model (omitting study time), and Person 3’s full model:

3.1.2.1 Model 1: \(Y \sim X_1\) (Prior Grade Only)

Person 1 says: “The past is irrelevant!”


Call:
lm(formula = test_score ~ prior_grade, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-31.380  -4.997   0.532   6.055  28.079 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.72153    1.37453   29.63   <2e-16 ***
prior_grade  0.64043    0.01936   33.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.987 on 998 degrees of freedom
Multiple R-squared:  0.5231,    Adjusted R-squared:  0.5226 
F-statistic:  1095 on 1 and 998 DF,  p-value: < 2.2e-16

3.1.2.2 Model 2: \(Y \sim X_2\) (Study Hours Only)

Person 2 says: “You can’t learn without a foundation!”


Call:
lm(formula = test_score ~ study_hours, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.627  -6.082   0.772   6.670  28.615 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 43.07007    1.55915   27.62   <2e-16 ***
study_hours  1.43838    0.05215   27.58   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.803 on 998 degrees of freedom
Multiple R-squared:  0.4325,    Adjusted R-squared:  0.432 
F-statistic: 760.7 on 1 and 998 DF,  p-value: < 2.2e-16

3.1.2.3 Model 3: \(Y \sim X_1 + X_2\) (Full Model)

Person 3 says: “Both matter!”


Call:
lm(formula = test_score ~ prior_grade + study_hours, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-30.1569  -5.4313   0.5202   6.2202  27.7070 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.41209    1.44799  27.218  < 2e-16 ***
prior_grade  0.54652    0.03879  14.090  < 2e-16 ***
study_hours  0.26737    0.09580   2.791  0.00536 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.957 on 997 degrees of freedom
Multiple R-squared:  0.5268,    Adjusted R-squared:  0.5258 
F-statistic: 554.9 on 2 and 997 DF,  p-value: < 2.2e-16

3.1.3 Comparing the Estimates

Let’s put the estimates side by side:

Show code
results <- data.frame(
  Coefficient = c("beta_0 (Intercept)", "beta_1 (Prior Grade)", "beta_2 (Study Hours)"),
  True_Value = c(true_beta_0, true_beta_1, true_beta_2),
  Model_1 = c(coef(model_1)[1], coef(model_1)[2], NA),
  Model_2 = c(coef(model_2)[1], NA, coef(model_2)[2]),
  Model_3 = c(coef(model_3)[1], coef(model_3)[2], coef(model_3)[3])
)

results %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  knitr::kable(col.names = c("Coefficient", "True Value", "Model 1 ($X_1$ only)",
                              "Model 2 ($X_2$ only)", "Model 3 (Full)"),
               caption = "Comparison of coefficient estimates across models")
Comparison of coefficient estimates across models
Coefficient True Value Model 1 (\(X_1\) only) Model 2 (\(X_2\) only) Model 3 (Full)
(Intercept) beta_0 (Intercept) 30.00 40.7215 43.0701 39.4121
prior_grade beta_1 (Prior Grade) 0.65 0.6404 NA 0.5465
beta_2 (Study Hours) 0.40 NA 1.4384 0.2674

Notice that:

  1. Model 1 overestimates \(\beta_1\) because it attributes some of the effect of study time to prior preparation
  2. Model 2 overestimates \(\beta_2\) because it attributes some of the effect of prior preparation to study time
  3. Model 3 recovers estimates close to the true values

3.1.4 Computing the Omitted Variable Bias

The OVB formula tells us exactly how much bias we should expect. When we omit a variable \(X_2\) from our regression of \(Y\) on \(X_1\):

\[\text{Bias}(\hat{\beta}_1) = \beta_2 \cdot \frac{\text{Cov}(X_1, X_2)}{\text{Var}(X_1)} = \beta_2 \cdot \delta_{21}\]

where \(\delta_{21}\) is the slope from regressing \(X_2\) on \(X_1\).

Model 1 (omitting study hours):

  • \(\delta_{21}\) (slope of \(X_2\) on \(X_1\)): 0.3512
  • Theoretical OVB = \(\beta_2 \cdot \delta_{21}\) = 0.4 \(\times\) 0.3512 = 0.1405
  • Actual bias (\(\hat{\beta}_1^{m1} - \hat{\beta}_1^{m3}\)): 0.0939

Model 2 (omitting prior grade):

  • \(\delta_{12}\) (slope of \(X_1\) on \(X_2\)): 2.1427
  • Theoretical OVB = \(\beta_1 \cdot \delta_{12}\) = 0.65 \(\times\) 2.1427 = 1.3927
  • Actual bias (\(\hat{\beta}_2^{m2} - \hat{\beta}_2^{m3}\)): 1.171
NoteWhy is the bias so much larger for Model 2?

Notice that Model 2’s bias (1.17) is much larger than Model 1’s bias (0.09). This isn’t a coincidence—it follows directly from the OVB formula.

The bias equals the true coefficient of the omitted variable times the relationship between the variables. Since \(\beta_1 = 0.65\) (prior grade) is larger than \(\beta_2 = 0.40\) (study hours), omitting prior grade causes a bigger distortion than omitting study hours.

Takeaway: When deciding which variables you can “safely” omit from a regression, it matters how important those variables are. Omitting a variable with a large true effect will bias your estimates much more than omitting a variable with a small effect—even if both are correlated with your included regressors.

3.1.5 Visualizing the Bias

Let’s visualize what’s happening. We’ll plot the data and show the true regression line versus the naive estimates:

Show code
mean_study <- mean(data$study_hours)

data$true_fitted_x1 <- true_beta_0 + true_beta_1 * data$prior_grade + true_beta_2 * mean_study
data$naive_fitted_x1 <- predict(model_1, data)
data$full_fitted_x1 <- coef(model_3)[1] + coef(model_3)[2] * data$prior_grade + coef(model_3)[3] * mean_study

ggplot(data, aes(x = prior_grade, y = test_score)) +
  geom_point(alpha = 0.3, color = "gray40") +
  geom_line(aes(y = true_fitted_x1, color = "True (DGP)"), linewidth = 1.2) +
  geom_line(aes(y = naive_fitted_x1, color = "Naive (Model 1)"), linewidth = 1.2, linetype = "dashed") +
  geom_line(aes(y = full_fitted_x1, color = "Full (Model 3)"), linewidth = 1.2, linetype = "dotted") +
  scale_color_manual(values = c("True (DGP)" = "darkgreen",
                                 "Naive (Model 1)" = "red",
                                 "Full (Model 3)" = "blue")) +
  labs(x = "Prior Grade (X₁)",
       y = "Test Score (Y)",
       title = "OVB: Effect of Prior Grade on Test Score",
       subtitle = paste0("Naive slope = ", round(coef(model_1)[2], 3),
                        " | True slope = ", true_beta_1,
                        " | Bias = ", round(coef(model_1)[2] - true_beta_1, 3)),
       color = "Regression Line") +
  theme_minimal() +
  theme(legend.position = "bottom")

Effect of prior grade: True vs. Naive regression lines

Effect of prior grade: True vs. Naive regression lines
Show code
mean_prior <- mean(data$prior_grade)

data$true_fitted_x2 <- true_beta_0 + true_beta_1 * mean_prior + true_beta_2 * data$study_hours
data$naive_fitted_x2 <- predict(model_2, data)
data$full_fitted_x2 <- coef(model_3)[1] + coef(model_3)[2] * mean_prior + coef(model_3)[3] * data$study_hours

ggplot(data, aes(x = study_hours, y = test_score)) +
  geom_point(alpha = 0.3, color = "gray40") +
  geom_line(aes(y = true_fitted_x2, color = "True (DGP)"), linewidth = 1.2) +
  geom_line(aes(y = naive_fitted_x2, color = "Naive (Model 2)"), linewidth = 1.2, linetype = "dashed") +
  geom_line(aes(y = full_fitted_x2, color = "Full (Model 3)"), linewidth = 1.2, linetype = "dotted") +
  scale_color_manual(values = c("True (DGP)" = "darkgreen",
                                 "Naive (Model 2)" = "red",
                                 "Full (Model 3)" = "blue")) +
  labs(x = "Study Hours per Week (X₂)",
       y = "Test Score (Y)",
       title = "OVB: Effect of Study Hours on Test Score",
       subtitle = paste0("Naive slope = ", round(coef(model_2)[2], 3),
                        " | True slope = ", true_beta_2,
                        " | Bias = ", round(coef(model_2)[2] - true_beta_2, 3)),
       color = "Regression Line") +
  theme_minimal() +
  theme(legend.position = "bottom")

Effect of study hours: True vs. Naive regression lines

Effect of study hours: True vs. Naive regression lines

3.1.6 Summary: The Direction of OVB

The sign of the omitted variable bias depends on two things:

  1. The sign of the omitted variable’s true coefficient (\(\beta\))
  2. The sign of the correlation between the included and omitted variables
Sign of \(\beta\) Sign of Corr(\(X_{included}\), \(X_{omitted}\)) Direction of Bias
+ + Positive (overestimate)
+ - Negative (underestimate)
- + Negative (underestimate)
- - Positive (overestimate)

In our example, both \(\beta_1\) and \(\beta_2\) are positive, and \(X_1\) and \(X_2\) are positively correlated. Therefore, omitting either variable leads to positive bias (overestimation) of the remaining coefficient.

NoteA Great Excerise

Solidify this concept for yourself is to think of a situation that is clear to you for each of the other three cases.

Show code
summary_df <- data.frame(
  Model = c("Model 1 (X₁ only)", "Model 2 (X₂ only)"),
  Omitted_Variable = c("Study Hours (X₂)", "Prior Grade (X₁)"),
  Included_Coef_Estimate = c(round(coef(model_1)[2], 4), round(coef(model_2)[2], 4)),
  True_Value = c(true_beta_1, true_beta_2),
  Bias = c(round(coef(model_1)[2] - true_beta_1, 4), round(coef(model_2)[2] - true_beta_2, 4)),
  Bias_Direction = c("Positive (overestimate)", "Positive (overestimate)")
)

knitr::kable(summary_df,
             col.names = c("Model", "Omitted Variable", "Estimate", "True Value", "Bias", "Direction"),
             caption = "Summary of OVB in each misspecified model")
Summary of OVB in each misspecified model
Model Omitted Variable Estimate True Value Bias Direction
prior_grade Model 1 (X₁ only) Study Hours (X₂) 0.6404 0.65 -0.0096 Positive (overestimate)
study_hours Model 2 (X₂ only) Prior Grade (X₁) 1.4384 0.40 1.0384 Positive (overestimate)

3.2 Simultaneity

When price and quantity are determined simultaneously in equilibrium, OLS on either the demand or supply equation is biased. This is covered in detail with a worked Bruin Boba example in Simultaneous Equations.

3.3 Slide Deck

TipWhat’s next?

When a regressor is measured with error instead of omitted entirely, see Measurement Error. For the instrumental variables solution to endogeneity, see Instrumental Variables. For panel data approaches, see Fixed Effects.


  1. A reminder that there is nothing “Simple” about “Simple Linear Regression”. It’s merely the terminology from the text. It’s also important to note that the order of these assumptions is slighly different from chapter 2: SR1 ↔︎ RS1, SR2 ↔︎ RS3: Both are exogeneity, but SR2 is stricter (conditions on all (x)); RS3 conditions only on (x_i). SR3 ↔︎ RS4, SR4 ↔︎ RS2, SR5 ↔︎ RS5, SR6 ↔︎ RS6.↩︎