Midterm 2 Answers
Worked Solutions with Explanations
A Note on Versions
There are two versions. This follows the “purple” version numbering, and I will put the blue version question numbers and answers in parentheses. CAE exams are blue exams (though they were printed on white paper).
Question 1
(Blue Version Question 4)
In a regression model with random regressors, which assumption ensures that OLS is unbiased and consistent?
- \(E(x)=0\)
- \(\operatorname{Cov}(x, e)=0\)
- \(\operatorname{Var}(x)=\sigma^{2}\)
- \(x\) and \(y\) are independent
(b) \(\operatorname{Cov}(x, e)=0\)
This is the professor’s answer. \(\operatorname{Cov}(x,e)=0\) ensures that OLS is consistent (\(\hat{\beta} \xrightarrow{p} \beta\)). The professor considers this sufficient for both unbiasedness and consistency. (Blue Version (c))
The professor’s answer is wrong on unbiasedness. This is not an edge case — it is a standard textbook distinction (Wooldridge Ch. 5, Greene Ch. 4):
- Unbiasedness requires \(E[u \mid X] = 0\) (mean independence, or fixed regressors)
- Consistency requires only \(\operatorname{Cov}(X, u) = 0\) (plus regularity conditions)
\(\operatorname{Cov}(X, u) = 0\) is strictly weaker than \(E[u \mid X] = 0\). The following counterexample shows the gap is real, not hypothetical.
Counterexample: Let
\[ \begin{aligned} X_i &\stackrel{iid}{\sim} \mathcal{N}(0,1) \\ y_i &= \beta X_i + u_i \\ u_i &= X_i^2 - 1 \end{aligned} \]
Note that \(\operatorname{Cov}(u, X) = E[X^3] - E[X^2 - 1]E[X] = 0\), \(E[X]=0\), and \(E[u]=0\).
Bias
The OLS estimator is
\[\hat{\beta} = \frac{\sum_{i=1}^n X_i y_i}{\sum_{i=1}^n X_i^2} = \frac{\sum_{i=1}^n X_i (\beta X_i + X_i^2 - 1)}{\sum_{i=1}^n X_i^2} = \beta + \frac{\sum_{i=1}^n X_i^3}{\sum_{i=1}^n X_i^2} - \frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n X_i^2}\]
Taking expectations conditional on \(X\):
\[E[\hat{\beta} \mid X] = \beta + \frac{\sum_{i=1}^n X_i^3}{\sum_{i=1}^n X_i^2} - \frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n X_i^2}\]
This is not equal to \(\beta\) for generic \(X\). For example, if \(X = (0.50, -0.14, 0.65)\) (a random draw from \(\mathcal{N}(0,1)\)), then
\[E[\hat{\beta} \mid X] = \beta + \frac{0.39}{0.69} - \frac{1.01}{0.69} \approx \beta - 0.90 \neq \beta\]
Consistency
By the law of large numbers,
\[\hat{\beta} = \beta + \frac{\sum_{i=1}^n X_i^3}{\sum_{i=1}^n X_i^2} - \frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n X_i^2} \xrightarrow{p} \beta + \frac{E[X^3]}{E[X^2]} - \frac{E[X]}{E[X^2]} = \beta + \frac{0}{1} - \frac{0}{1} = \beta\]
Conclusion
\(\operatorname{Cov}(u, X) = 0\) is sufficient for consistency but not for unbiasedness. The correct answer to the question as stated (unbiased and consistent) is (d) \(x\) and \(y\) are independent, which implies \(E[u \mid X] = 0\) and thus guarantees both properties. None of the listed options cleanly states \(E[u \mid X] = 0\), but (d) is the only one that implies it. \(\square\)
Question 2
(Blue Version Question 22)
Below is output from an IV regression estimating the effect of education (EDUC) on log wages, using distance to college (DISTANCE) as an instrument:
Call:
ivreg(formula = log(wage) ~ educ + exper | exper + distance)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.5820 0.4250 8.427 <2e-16 ***
educ 0.1320 0.0520 2.538 0.0118 *
exper 0.0450 0.0085 5.294 2.1e-07 ***
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 1 497 8.234 0.00425 **
Wu-Hausman 1 496 0.892 0.34520
Sargan 0 NA NA NA
What is the primary concern with this IV estimation?
- The instrument likely fails the weak instruments test (\(F<10\))
- The Wu-Hausman test suggests EDUC is not endogenous
- The Sargan test indicates invalid instruments
- The coefficient on EXPER is too large
(a) The instrument likely fails the weak instruments test (\(F<10\))
The F-statistic for weak instruments is \(8.234 < 10\) (rule-of-thumb threshold), so DISTANCE may be a weak instrument.
- Wu-Hausman: \(p = 0.345\) (not significant), but this test has low power with weak instruments, so the result is unreliable.
- Sargan: NA because the model is exactly identified (1 instrument, 1 endogenous variable).
- EXPER coefficient: \(0.045\) is a reasonable return to experience.
Key Point: Always check the weak instruments F-statistic first — if \(F < 10\), all other diagnostics are unreliable. (Blue Version (b))
Question 3
(Blue Version Question 23)
Why is the Sargan test “NA” in the output from Question 2?
- The sample size is too small
- The first-stage is too weak
- There is only one instrument and one endogenous variable (exactly identified)
- The instruments are perfectly correlated
(c) There is only one instrument and one endogenous variable (exactly identified)
The Sargan test checks the validity of overidentifying restrictions. It requires more instruments than endogenous variables.
- Endogenous variables: 1 (EDUC)
- Excluded instruments: 1 (DISTANCE)
- \(1 = 1 \implies\) exactly identified \(\implies\) Sargan is undefined
Key Point: The Sargan test needs at least one “extra” instrument beyond what is needed for identification. Degrees of freedom \(= L - K\), where \(L\) = number of instruments, \(K\) = number of endogenous variables. (Blue Version (d))
Question 4
(Blue Version Question 5)
Suppose we have a system of two structural equations, where \(y_1\) and \(y_2\) are endogenous and \(x_1\), \(x_2\) are exogenous:
\[y_1 = \alpha_1 y_2 + \beta_1 x_1 + e_1, \qquad y_2 = \alpha_2 y_1 + \beta_2 x_2 + e_2\]
Which equation represents a correct reduced-form equation for \(\widehat{y}_1\)?
- \(\widehat{y}_1 = \widehat{\pi}_1 x_1 + \widehat{\pi}_2 x_2 + v_1\)
- \(\widehat{y}_1 = \widehat{\pi}_1 \widehat{x}_1 + \widehat{\pi}_2 \widehat{x}_2 + v_1\)
- \(\widehat{y}_1 = \widehat{\theta}_1 \widehat{y}_2 + \widehat{\pi}_1 x_1 + \widehat{\pi}_1 x_2 + v_1\)
- \(\widehat{y}_1 = \widehat{\theta}_1 \widehat{y}_2 + \widehat{\pi}_1 \widehat{x}_1 + \widehat{\pi}_2 \widehat{x}_2 + v_1\)
(a) \(\widehat{y}_1 = \widehat{\pi}_1 x_1 + \widehat{\pi}_2 x_2 + v_1\)
A reduced-form equation expresses an endogenous variable as a function of only exogenous variables. Substituting Eq. 2 into Eq. 1 and solving:
\[y_1 = \underbrace{\frac{\beta_1}{1 - \alpha_1\alpha_2}}_{\pi_1} x_1 + \underbrace{\frac{\alpha_1 \beta_2}{1 - \alpha_1\alpha_2}}_{\pi_2} x_2 + v_1\]
- Wrong: \(x\) variables are exogenous — should not have hats
- (c), (d) Wrong: include \(\widehat{y}_2\) on RHS — not a reduced form
Key Point: Reduced form = only exogenous variables on the RHS. (Blue Version (b))
Question 5
(Blue Version Question 20)
Below is R code computing the correlation between potential instruments and an endogenous variable (PRICE):
> cor(housing_data$property_tax, housing_data$price)
[1] 0.7823
> cor(housing_data$mortgage_rate, housing_data$price)
[1] 0.0234
Which variable would likely be a better instrument for PRICE based on the relevance condition?
- MORTGAGE_RATE because the correlation is close to zero
- PROPERTY_TAX because it has a strong correlation with PRICE
- Both are equally good
- Cannot determine without testing exogeneity
(b) PROPERTY_TAX because it has a strong correlation with PRICE
The relevance condition requires \(\operatorname{Cov}(z, x) \neq 0\). A stronger correlation means a stronger instrument.
- PROPERTY_TAX: \(r = 0.7823\) (very strong)
- MORTGAGE_RATE: \(r = 0.0234\) (essentially zero)
Common Error: (a) confuses relevance with exogeneity. A correlation close to zero means the instrument is irrelevant, not exogenous.
Key Point: The question asks about relevance only. A valid instrument needs both relevance and exogeneity, but here we evaluate relevance alone. (Blue Version (b))
Question 6
(Blue Version Question 21)
The instrumental variables (IV) estimator for the simple regression model is:
\[\widehat{\beta}_{2,IV} = \frac{\sum_i (z_i - \bar{z})(y_i - \bar{y})}{\sum_i (z_i - \bar{z})(x_i - \bar{x})}\]
What would cause this estimator to be undefined or unreliable?
- \(z\) is not correlated with \(x\) (weak instrument)
- \(z\) is correlated with \(e\)
- The sample size is too large
- Both (a) and (b)
(d) Both (a) and (b)
Both (a) and (b) cause problems, but they break the estimator in different ways:
- (a) Relevance failure: If \(\operatorname{Cov}(z,x) \approx 0\), the denominator \(\to 0\), making \(\widehat{\beta}_{2,IV}\) undefined (or explosive in finite samples). This is a weak/irrelevant instrument problem.
- (b) Exogeneity failure: If \(\operatorname{Cov}(z,e) \neq 0\), the estimator remains numerically well-defined but converges to the wrong value: \(\hat{\beta}_{2,IV} \xrightarrow{p} \beta_2 + \frac{\operatorname{Cov}(z,e)}{\operatorname{Cov}(z,x)}\). This makes it inconsistent.
- (c) A large sample size actually improves the estimator (consistency is asymptotic).
Since the question asks about “undefined or unreliable,” both (a) and (b) qualify — (a) makes it undefined, (b) makes it unreliable — so (d) is correct.
Key Point: A valid instrument must satisfy both relevance (\(\operatorname{Cov}(z,x) \neq 0\)) and exogeneity (\(\operatorname{Cov}(z,e) = 0\)). Violating relevance destroys the denominator; violating exogeneity contaminates the numerator. (Blue Version (d))
Question 7
(Blue Version Question 24)
Below is output from an IV regression with TWO instruments (TAX1 and TAX2) for one endogenous variable (PRICE):
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 2 345 156.89 < 2e-16 ***
Wu-Hausman 1 344 5.23 0.0227 *
Sargan 1 NA 8.45 0.0037 **
What should you conclude from the Sargan test result?
- The instruments pass the validity test
- At least one instrument appears to be invalid (fails exogeneity)
- Both instruments are weak
- The model is not overidentified
(b) At least one instrument appears to be invalid (fails exogeneity)
Sargan test: \(H_0\): All instruments are valid. \(H_1\): At least one is invalid.
- Sargan statistic \(= 8.45\), \(p = 0.0037 < 0.05\) \(\implies\) Reject \(H_0\)
- At least one instrument fails the exogeneity requirement
Why others are wrong:
- Instruments are strong: \(F = 156.89 \gg 10\)
- Model is overidentified: \(2 - 1 = 1\) overidentifying restriction
Key Point: Sargan \(p < 0.05\) means reject validity. Investigate which instrument is problematic. (Blue Version (b))
Question 8
(Blue Version Question 25)
Below are diagnostics from an IV regression with three instruments for one endogenous variable:
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 3 246 245.67 <2e-16 ***
Wu-Hausman 1 245 12.45 0.00048 ***
Sargan 2 NA 1.23 0.54120
Based on these results, what should you conclude?
- Use OLS instead of IV (Wu-Hausman not significant)
- Do not use IV; instruments do not pass the Sargan test
- Use IV; instruments are strong, endogeneity is present, and overidentifying restrictions are valid
- Need more instruments
(c) Use IV; instruments are strong, endogeneity is present, and overidentifying restrictions are valid
Interpret each diagnostic in order:
- Weak instruments: \(F = 245.67 \gg 10\) \(\implies\) instruments are strong
- Wu-Hausman: \(p = 0.00048 < 0.05\) \(\implies\) reject exogeneity of the endogenous regressor \(\implies\) IV is needed (rules out OLS)
- Sargan: \(p = 0.541 > 0.05\) \(\implies\) fail to reject \(H_0\) \(\implies\) instruments pass validity
Key Point: This is the ideal scenario for IV: strong instruments, confirmed endogeneity, and valid overidentifying restrictions. Proceed with confidence. (Blue Version (a))
Question 9
(Blue Version Question 7)
Consider a house price model: PRICE \(= \beta_1 + \beta_2\,\text{SQFT} + \beta_3\,\text{BATHS} + e\). Suppose SQFT is measured with error. If we have an instrument for SQFT, the IV estimator will:
- Have smaller standard errors than OLS
- Be identical to OLS if the measurement error is small
- Always be more efficient than OLS
- Have larger standard errors than OLS but be consistent
(d) Have larger standard errors than OLS but be consistent
When SQFT is measured with error, OLS suffers from attenuation bias (coefficient biased toward zero). OLS is inconsistent.
IV removes this bias using a valid instrument but at a cost:
- IV uses only the variation in \(x\) explained by \(z\), discarding some information
- Result: larger standard errors (less efficient) but consistent estimates
Why others are wrong:
- (a)/(c) IV is always less efficient (larger SEs) than OLS
- OLS is inconsistent regardless of how small the measurement error is
Key Point: The IV trade-off is always consistency for efficiency. IV standard errors > OLS standard errors, but IV is consistent when OLS is not. (Blue Version (a))
Question 10
(Blue Version Question 6)
Below is output from a first-stage regression for a housing price model where LOT_SIZE is endogenous and LOCAL_TAX is the instrument:
Call: lm(lot_size ~ bedrooms + local_tax)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2450.33 325.67 7.524 1.2e-12 ***
bedrooms 245.89 45.23 5.437 8.9e-08 ***
local_tax -15.67 18.92 -0.828 0.409
---
Residual standard error: 1250 on 247 DF
F-statistic: 12.45 on 2 and 247 DF
Based on this output, what can you conclude about LOCAL_TAX as an instrument?
- It is likely a weak instrument because it is not significant (\(p=0.409\))
- It cannot be evaluated without the second-stage results
- It is valid because the coefficient is negative
- It is a strong instrument because the F-statistic is above 10
(a) It is likely a weak instrument because it is not significant (\(p=0.409\))
To evaluate instrument strength, look at the excluded instrument’s significance, not the overall F-statistic.
- LOCAL_TAX: \(t = -0.828\), \(p = 0.409\) — not significant
- Partial F on LOCAL_TAX: \((-0.828)^2 \approx 0.686 \ll 10\)
Note: the equivalence \(F = t^2\) holds here because there is a single excluded instrument. With multiple excluded instruments, you would need a joint F-test on all excluded instruments together.
Common Mistake (d): The overall F-statistic of \(12.45\) tests whether all regressors jointly predict LOT_SIZE. BEDROOMS is an included exogenous variable, not an instrument. The relevant test is LOCAL_TAX alone.
Key Point: Instrument strength = significance of the excluded instrument in the first stage, not the overall F-stat. (Blue Version (b))
Question 11
(Blue Version Question 10)
When you have more instruments than endogenous regressors (overidentification), you can:
- Choose the best instrument and discard the others
- Use all instruments and test their validity with a Sargan/Hansen test
- Only use 2SLS if you have exactly as many instruments as endogenous variables
- Average the results from using each instrument separately
(b) Use all instruments and test their validity with a Sargan/Hansen test
When overidentified (more instruments than endogenous variables):
- Use all instruments simultaneously in 2SLS — more efficient than any single instrument
- Test validity with the Sargan/Hansen J-test of overidentifying restrictions
Why others are wrong:
- Discarding instruments wastes information
- 2SLS is designed for overidentification
- Averaging separate IV estimates is not standard or efficient
Key Point: Overidentification is desirable — it allows both efficiency gains and validity testing. (Blue Version (c))
Question 12
(Blue Version Question 13)
In R, to estimate a 2SLS model using the ivreg function from the AER package:
model <- ivreg(wage ~ educ + exper | exper + sibling_educ)
What is the endogenous variable and what is the instrument?
- Endogenous: wage, Instrument: sibling_educ
- Endogenous: exper, Instrument: sibling_educ
- Endogenous: educ, Instrument: sibling_educ
- Endogenous: sibling_educ, Instrument: educ
(c) Endogenous: educ, Instrument: sibling_educ
The ivreg formula syntax: y ~ x1 + x2 | z1 + z2
- Left of
|: structural equation regressors - Right of
|: all exogenous variables (instruments + included exogenous)
exper appears on both sides \(\implies\) exogenous (instruments for itself). educ appears on left but not right \(\implies\) endogenous variable. sibling_educ appears on right but not left \(\implies\) excluded instrument.
Key Point: Variables on the left of | but absent from the right are endogenous. Variables on the right but absent from the left are excluded instruments. (Blue Version (d))
Question 13
(Blue Version Question 14)
A researcher estimates a wage equation and finds that the 2SLS estimate of the returns to schooling is \(0.06\) (6%), while the OLS estimate is \(0.11\) (11%). If ability is an omitted variable that positively affects both schooling and wages, this result is:
- Surprising, because IV should give a larger estimate
- Expected, because OLS has upward bias from omitted ability
- Evidence that the instrument is invalid
- Evidence that schooling is not endogenous
(b) Expected, because OLS has upward bias from omitted ability
Classic omitted variable bias. Ability is positively correlated with both schooling and wages:
\[\text{bias} = \frac{\operatorname{Cov}(\text{schooling}, \text{ability})}{\operatorname{Var}(\text{schooling})} \cdot \gamma_{\text{ability}} > 0\]
\[\hat{\beta}_{\text{OLS}} = \beta_{\text{true}} + \text{positive bias} \implies \hat{\beta}_{\text{OLS}} > \beta_{\text{true}}\]
IV removes this bias: \(\hat{\beta}_{2SLS} = 0.06 < \hat{\beta}_{OLS} = 0.11\) is exactly what we expect.
Key Point: When OVB is positive, OLS overestimates the effect. IV gives the true causal effect (assuming valid instrument). (Blue Version (c))
Question 14
(Blue Version Question 15)
In a simultaneous equations system, endogenous variables are:
- Determined outside the system
- Determined jointly within the system
- Always equal to the error terms
- Independent of each other
(b) Determined jointly within the system
By definition, endogenous variables are those whose values are determined jointly within the system. Example: in supply-demand, \(P\) and \(Q\) are both determined by the intersection.
- Describes exogenous variables
- Error terms are random disturbances, not endogenous variables
- Endogenous variables are not independent — joint determination is what makes them endogenous
Key Point: Endogenous = jointly determined within the model. Exogenous = determined outside. (Blue Version (c))
Question 15
(Blue Version Question 16)
Consider a supply and demand model for corn:
\[\text{Demand: } Q = \alpha_1 + \alpha_2 P + \alpha_3\,\text{INCOME} + e_d\] \[\text{Supply: } Q = \beta_1 + \beta_2 P + \beta_3\,\text{RAINFALL} + e_s\]
In this system, which variables are endogenous?
- INCOME and RAINFALL
- \(P\) and \(Q\)
- \(e_d\) and \(e_s\)
- INCOME, RAINFALL, \(P\), and \(Q\)
(b) \(P\) and \(Q\)
\(P\) and \(Q\) are determined jointly by the intersection of supply and demand — they are endogenous.
- INCOME and RAINFALL are exogenous (determined outside the model; they shift the curves)
- \(e_d\) and \(e_s\) are random disturbances, not variables
Key Point: In supply-demand models, price and quantity are always the endogenous variables. Demand/supply shifters are exogenous.
The professor’s purple answer key lists (d), which includes INCOME and RAINFALL as endogenous. This is incorrect — INCOME and RAINFALL are exogenous by the structure of the model. They serve as the excluded instruments that identify each equation.
(Blue Version (a))
Question 16
(Blue Version Question 17)
Why does OLS fail when estimating a single equation from a simultaneous system?
- The sample size is too small
- The endogenous right-hand side variables are correlated with the error term
- The exogenous variables are correlated with each other
- The errors are heteroskedastic
(b) The endogenous right-hand side variables are correlated with the error term
In a simultaneous system, the endogenous RHS variable (e.g., \(P\) in the demand equation) is determined jointly with \(Q\). Because \(P\) depends on \(e_d\) through the equilibrium:
\[\operatorname{Cov}(P, e_d) \neq 0\]
This violates the key OLS assumption and causes simultaneity bias: OLS is both biased and inconsistent.
Why others are wrong:
- OLS fails in simultaneous systems regardless of sample size — this is a structural problem, not a finite-sample one
- Multicollinearity among exogenous variables inflates standard errors but does not cause bias or inconsistency
- Heteroskedasticity affects efficiency and standard errors but does not cause OLS to be biased or inconsistent
Key Point: Simultaneity \(\implies\) endogenous RHS variable is correlated with the error \(\implies\) need IV/2SLS. (Blue Version (a))
Question 17
(Blue Version Question 2)
The reduced-form equation for price in a supply-demand system expresses:
- \(P\) as a function of \(Q\) only
- \(P\) as a function of all exogenous variables only
- \(P\) as a function of \(e_d\) and \(e_s\) only
- \(P\) as a function of both endogenous and exogenous variables
(b) \(P\) as a function of all exogenous variables only
A reduced-form equation is obtained by solving the simultaneous system so that each endogenous variable is expressed as a function of only exogenous variables (plus a composite error):
\[P = \pi_0 + \pi_1\,\text{INCOME} + \pi_2\,\text{RAINFALL} + v\]
- Wrong: \(Q\) is endogenous
- Wrong: reduced form depends on exogenous variables, not just errors
- Wrong: the whole point is to eliminate endogenous variables from the RHS
Key Point: Reduced form = endogenous variable as a function of exogenous variables only. (Blue Version (c))
Question 18
(Blue Version Question 1)
If you regress quantity on price using market equilibrium data, you are likely estimating:
- The demand curve
- The supply curve
- Neither curve — just the equilibrium relationship
- Both curves simultaneously
(c) Neither curve — just the equilibrium relationship
Market equilibrium data consists of \((P, Q)\) pairs at the intersection of supply and demand. Both curves shift over time, so:
- OLS traces out shifting equilibrium points
- These points do not lie along any single curve
- The estimate is neither the demand nor supply elasticity — just a meaningless hybrid
Key Point: Without instruments to isolate shifts in one curve, OLS cannot identify either structural relationship. This is why we need 2SLS for simultaneous systems. (Blue Version (d))
Question 19
(Blue Version Question 3)
Suppose we estimate: price \(= \beta_0 + \beta_1\,\text{sqft} + \beta_2\,\text{bdrms} + e\), where price is in $1000s. We run 5-fold CV:
fit = lm(price ~ sqft + bdrms, x=TRUE, y=TRUE, data=hprice1)
cv.lm(fit, k = 5)
Mean absolute error : 48.25429
Sample standard deviation : 7.6221
Mean squared error : 4343.904
Sample standard deviation : 1182.484
Root mean squared error : 65.40264
Sample standard deviation : 9.110364
Which is the correct interpretation of the RMSE?
- On average, the estimates of home prices are off by $65.40k
- On average, the estimates of home prices are off by \(65.0\%\)
- On average, the estimates of home prices are off by $65.40
- On average, the estimates of home prices are off by $654.0
(a) On average, the estimates of home prices are off by $65.40k
RMSE is measured in the same units as the dependent variable. Since price is in $1000s:
\[\text{RMSE} = 65.40 \implies \text{typical prediction error} \approx 65.40 \times \$1{,}000 = \$65{,}400 = \$65.40\text{k}\]
- Wrong: RMSE is not a percentage; it is in the units of \(y\)
- Wrong: ignores that price is in $1000s
- Wrong: no reason to multiply by 10
Technical note: RMSE (\(\sqrt{\frac{1}{n}\sum e_i^2}\)) is not the same as MAE (\(\frac{1}{n}\sum |e_i|\)). RMSE penalizes large errors more heavily due to the squaring, so it is always \(\geq\) MAE. Here, MAE \(= 48.25\)k while RMSE \(= 65.40\)k. Strictly speaking, RMSE measures the “typical” prediction error in a root-mean-square sense, not the simple arithmetic average of absolute errors.
Key Point: RMSE inherits the units of the dependent variable. Always check what units \(y\) is measured in. (Blue Version (b))
Question 20
(Blue Version Question 12)
Consider the agricultural market model:
\[\text{Demand: } Q = \alpha_1 + \alpha_2 P + \alpha_3\,\text{PERCAPINCOME} + e_d\] \[\text{Supply: } Q = \beta_1 + \beta_2 P + \beta_3\,\text{WEATHER} + e_s\]
Which equation (if any) is identified?
- Only the demand equation
- Only the supply equation
- Both equations
- Neither equation
(c) Both equations
Check the order condition: excluded exogenous variables \(\geq\) endogenous RHS variables.
Demand: Endogenous RHS: \(P\) (1). Excluded from demand: WEATHER (1). \(1 \geq 1\) \(\implies\) identified. WEATHER serves as instrument for \(P\).
Supply: Endogenous RHS: \(P\) (1). Excluded from supply: PERCAPINCOME (1). \(1 \geq 1\) \(\implies\) identified. PERCAPINCOME serves as instrument for \(P\).
Key Point: Each equation needs at least as many excluded exogenous variables as endogenous RHS variables (order condition for identification). (Blue Version (d))
Question 21
(Blue Version Question 11)
Below is output from estimating the housing supply equation with 2SLS:
2SLS estimates for 'supply'
Model Formula: quantity ~ price + labor_cost + trend
Instruments: ~income + labor_cost + trend
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.234 15.678 2.885 0.0067 **
price 0.542 0.156 3.474 0.0015 **
labor_cost -0.287 0.092 -3.120 0.0038 **
trend 1.234 0.345 3.577 0.0011 **
Which coefficient has the wrong expected sign based on economic theory?
- price (should be negative)
- labor_cost (should be positive)
- trend (should be negative)
- None of the above — all coefficients have expected signs
(d) None of the above — all coefficients have expected signs
This is a supply equation with quantity on the LHS:
- price (\(+0.542\)): higher price \(\implies\) more supply. Positive is correct.
- labor_cost (\(-0.287\)): higher input costs \(\implies\) less supply (supply shifts left). Negative is correct.
- trend (\(+1.234\)): positive time trend reflects development/technology over time. Positive is reasonable.
Note: Option (b) claims labor_cost “should be positive.” This would be true in a supply-price function (\(P\) on LHS), but with \(Q\) on the LHS, higher costs reduce quantity supplied.
Key Point: Always check whether quantity or price is on the LHS before evaluating expected signs. (Blue Version (d))
Question 22
(Blue Version Question 18)
Consider a wage-employment model:
\[\text{Labor Demand: } L = \beta_1 + \beta_2 W + \beta_3\,\text{OUTPUT} + e_d\] \[\text{Labor Supply: } L = \alpha_1 + \alpha_2 W + \alpha_3\,\text{UNEMP} + e_s\]
To estimate the labor demand equation using 2SLS, which variable would you use as an instrument for \(W\)?
- OUTPUT (it’s in the demand equation)
- UNEMP (it shifts supply but not demand)
- \(e_d\) (the error term)
- No instrument is needed
(b) UNEMP (it shifts supply but not demand)
A valid instrument for \(W\) in the demand equation must be:
- Relevant: correlated with \(W\)
- Exogenous: not in the demand equation, uncorrelated with \(e_d\)
- OUTPUT is already in the demand equation — cannot be an excluded instrument
- UNEMP: in supply but not in demand. It shifts supply (affecting \(W\)) without directly entering demand
- \(e_d\) is unobservable — cannot be used as an instrument
- \(W\) is endogenous, so an instrument is needed
Key Point: Use variables from the other equation as instruments (exclusion restriction). (Blue Version (c))
Question 23
(Blue Version Question 19)
The R code below uses the systemfit package to estimate a simultaneous equations system:
library(systemfit)
demand_eq <- quantity ~ price + income
supply_eq <- quantity ~ price + cost
system_eqs <- list(demand_eq, supply_eq)
instruments <- ~ income + cost
result <- systemfit(system_eqs, method="2SLS",
inst=instruments, data=market_data)
What are the endogenous variables in this system?
- quantity and price
- income and cost
- quantity, price, income, and cost
- Only quantity
(a) quantity and price
From the systemfit code:
- Demand:
quantity ~ price + income - Supply:
quantity ~ price + cost - Instruments:
~ income + cost
quantity and price appear as dependent/RHS variables jointly determined by the system \(\implies\) endogenous.
income and cost appear in the instrument list \(\implies\) exogenous.
Key Point: Variables in the instrument list are exogenous. Variables that appear as both dependent and RHS variables across equations are endogenous. (Blue Version (b))
Question 24
(Blue Version Question 9)
Consider a supply-demand system for rental apartments. Below is the reduced-form regression for RENT:
Call: lm(rent ~ income + construction_cost)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 450.23 125.45 3.588 0.00048 ***
income 2.15 0.35 6.143 3.2e-09 ***
construction_cost 1.85 0.28 6.607 6.1e-10 ***
---
F-statistic: 89.34 on 2 and 297 DF, p-value: < 2.2e-16
The positive coefficient on INCOME in the reduced form for RENT indicates:
- Higher income shifts supply right
- The demand equation is not identified
- Higher income shifts demand right, increasing equilibrium rent
- INCOME is endogenous
(c) Higher income shifts demand right, increasing equilibrium rent
The reduced-form coefficient on INCOME is \(+2.15\) (\(p < 0.001\)):
- Higher INCOME shifts the demand curve right (people can afford more)
- With supply unchanged, increased demand drives up equilibrium RENT
- This is a reduced-form (equilibrium) effect
Why others are wrong:
- Wrong: INCOME is a demand shifter, not supply
- Wrong: demand is identified (CONSTRUCTION_COST is excluded from demand)
- Wrong: INCOME is exogenous (used as instrument)
Key Point: Reduced-form coefficients capture total equilibrium effects of exogenous variables. (Blue Version (d))
Question 25
(Blue Version Question 8)
A researcher estimates both demand and supply using 2SLS and finds that the price elasticity of demand is \(-0.8\) and the price elasticity of supply is \(+1.2\). If a policy increases production costs (shifting supply left), what happens to equilibrium price and quantity?
- Price increases, quantity decreases
- Price decreases, quantity increases
- Both price and quantity increase
- Both price and quantity decrease
(a) Price increases, quantity decreases
When production costs increase, the supply curve shifts left:
- Price increases: reduced supply creates excess demand, bidding price up
- Quantity decreases: at the higher price, consumers demand less (moving along the demand curve)
The specific elasticities (\(-0.8\) for demand, \(+1.2\) for supply) determine the magnitudes of changes, but not the directions — those depend only on the signs of the slopes.
Key Point: Leftward supply shift + downward-sloping demand \(\implies\) price up, quantity down. This is standard comparative statics. (Blue Version (a))
Answer Key Table
| Purple Q# | Purple Answer | Blue Q# | Blue Answer |
|---|---|---|---|
| 1 | B* | 4 | C* |
| 2 | A | 22 | B |
| 3 | C | 23 | D |
| 4 | A | 5 | B |
| 5 | B | 20 | B |
| 6 | D | 21 | D |
| 7 | B | 24 | B |
| 8 | C | 25 | A |
| 9 | D | 7 | A |
| 10 | A | 6 | B |
| 11 | B | 10 | C |
| 12 | C | 13 | D |
| 13 | B | 14 | C |
| 14 | B | 15 | C |
| 15 | B** | 16 | A |
| 16 | B** | 17 | A |
| 17 | B | 2 | C |
| 18 | C | 1 | D |
| 19 | A | 3 | B |
| 20 | C | 12 | D |
| 21 | D | 11 | D |
| 22 | B | 18 | C |
| 23 | A | 19 | B |
| 24 | C | 9 | D |
| 25 | A | 8 | A |
* Question 1 is disputed. Professor’s answer is (b)/(c). See the correction note above for why \(\operatorname{Cov}(x,e)=0\) guarantees consistency but not unbiasedness.
** Questions 15 and 16: Professor’s purple answer key listed (d) for both. The correct answers are (b) for both — see explanations above.