Show code
## Load in mroz dataset and take only the observations who are in the labor force.
mroz = read.csv("data/mroz.csv")
mroz_short <- mroz[mroz$lfp == 1, ]Instrumental Variables and 2SLS
These problems accompany Instrumental Variables, Omitted Variable Bias, and Measurement Error. Read those chapters first for the theory behind these exercises.
Consider the wage equation in Example 10.5:
\[ \log(WAGE) = \beta_{1} + \beta_{2}EXPER + \beta_{3}EXPER^{2} + \beta_{4}EDUC + e \]
Two possible instruments for \(EDUC\) are \(NEARC4\) and \(NEARC2\), where these are dummy variables indicating whether the individual lived near a 4-year college or a 2-year college at age 10. Speculate as to why these might be potentially valid IV.
We are concerned that some unobserved ability could determine both the level of education and earnings of individuals. This unobserved variable would appear in the residual term, and could lead to an endogeneity issue with education. The potential instrumental variables are proposed because family location as a 10-year old child should not be related to later earnings directly, but it could change someone’s preference for education. These instruments could then satisfy the endogeneity and relevance conditions for IVs.
Explain the steps (not the computer command) required to carry out the regression-based Hausman test, assuming we use both IV.
Step 1: Estimate the first-stage model \(EDUC = \gamma_{1} + \theta_{1}NEARC4 + \theta_{2}NEARC2 + v\) by OLS and save the estimate residuals \(\hat{v}\).
Step 2: Augment the main regression by adding these estimate first-stage residuals: \[ \log(WAGE) = \beta_{1} + \beta_{2}EXPER + \beta_{3}EXPER^{2} + \beta_{4}EDUC + \delta\hat{v} + e \]
Step 3: Conduct a standard \(t\)-test of significance on \(\delta\). If \(\delta \neq 0\), then endogeneity exists in the model.
Using a large data set, the \(p\)-value for the regression-based Hausman test for the model in Example 10.5, using only \(NEARC4\) as an IV is \(0.28\); using only \(NEARC2\) the \(p\)-value is \(0.0736\), and using both IV the \(p\)-value is \(0.0873\) [with robust standard errors it is** \(0.0854\)**]. What should we conclude about the endogeneity of \(EDUC\) in this model?
Given these \(p\)-values are all greater than \(5\%\), we cannot conclude that \(NEARC2\) and \(NEARC4\) are valid IVs for the potentially endogenous variable \(EDUC\). This could be because they are weakly related to \(EDUC\), or they are strongly related to the regression error.
We compute the IV/2SLS residuals, using both \(NEARC4\) and \(NEARC2\) as IV. In the regression of these 2SLS residuals on all exogenous variables and the IV, with \(N = 3010\) observations, all regression \(p\)-values are greater than \(0.30\) and the \(R^{2} = 0.000415\). What can you conclude based on these results?
We want to conduct the Sargan test of over-identification. This is an \(N\times R^{2}\) type test. As there are \(2\) potential IVs and only one instrument, then we use the \(\chi_{1}^{2}\) distribution to conduct the test. The \(5\%\) significance level for this distribution is \(3.841\), while our \(N \times R^{2}\) statistic is \(1.248\). Thus we fail to reject the null that the extra instrument is valid.
The main reason we seldom use OLS to estimate the coefficients of equations with endogenous variables is that other estimation methods are available that yield better fitting equations. Is this statement true or false, or are you uncertain? Explain the reasoning of your answer.
This statement is true. The estimated parameters from OLS suffer from bias resulting from the endogenous variables. Other estimation models (e.g. IV) are able to remove this endogeneity bias at the cost of larger standard errors. However, this is generally a worthwhile trade-off.
The \(F\)-test of the joint significance of \(NEARC4\) and \(NEARC2\) in the first-stage regression is \(7.89\). The \(95\%\) interval estimates for the coefficient of education using \(OLS\) is \(0.0678\) to \(0.082\), and using 2SLS it is \(0.054\) to \(0.260\). Explain why the width of the interval estimates is so different.
The \(F\)-statistic of \(7.89\) implies that \(NEARC4\) and \(NEARC2\) are weak instruments. When IVs are weak, then the standard errors of the estimates increase. This will result in a wider interval estimate.
Consider the data file mroz on working wives. Use the 428 observations on married women who participate in the labor force. In this exercise, we examine the effectiveness of a parent’s college education as an instrumental variable.
## Load in mroz dataset and take only the observations who are in the labor force.
mroz = read.csv("data/mroz.csv")
mroz_short <- mroz[mroz$lfp == 1, ]Create two new variables. MOTHERCOLL is a dummy variable equaling one if MOTHEREDUC > 12, zero otherwise. Similarly, FATHERCOLL equals one if FATHEREDUC > 12 and zero otherwise. What percentage of parents have some college education in this sample?
mroz_short$mothercoll <- ifelse(mroz_short$mothereduc > 12, 1, 0)
mroz_short$fathercoll <- ifelse(mroz_short$fathereduc > 12, 1, 0)
summary(mroz_short[, c("mothercoll", "fathercoll")]) mothercoll fathercoll
Min. :0.000 Min. :0.000
1st Qu.:0.000 1st Qu.:0.000
Median :0.000 Median :0.000
Mean :0.121 Mean :0.117
3rd Qu.:0.000 3rd Qu.:0.000
Max. :1.000 Max. :1.000
In this sample, 12.1% of mothers and 11.7% of fathers have some college education.
Find the correlations between EDUC, MOTHERCOLL, and FATHERCOLL. Are the magnitudes of these correlations important? Can you make a logical argument why MOTHERCOLL and FATHERCOLL might be better instruments than MOTHEREDUC and FATHEREDUC?
cor(mroz_short[, c("educ", "mothercoll", "fathercoll")]) educ mothercoll fathercoll
educ 1.000000 0.359470 0.398496
mothercoll 0.359470 1.000000 0.354571
fathercoll 0.398496 0.354571 1.000000
Estimate the wage equation in Example 10.5 using MOTHERCOLL as the instrumental variable. What is the 95% interval estimate for the coefficient of EDUC?
iv_mroz <- ivreg(I(log(wage)) ~ educ + exper + I(exper^2)
| mothercoll + exper + I(exper^2),
data = mroz_short)
summary(iv_mroz)
Call:
ivreg(formula = I(log(wage)) ~ educ + exper + I(exper^2) | mothercoll +
exper + I(exper^2), data = mroz_short)
Residuals:
Min 1Q Median 3Q Max
-3.0872 -0.3244 0.0415 0.3663 2.3562
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.132756 0.496533 -0.27 0.7893
educ 0.076018 0.039408 1.93 0.0544 .
exper 0.043344 0.013414 3.23 0.0013 **
I(exper^2) -0.000871 0.000402 -2.17 0.0307 *
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 1 424 63.56 1.5e-14 ***
Wu-Hausman 1 423 0.74 0.39
Sargan 0 NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.67 on 424 degrees of freedom
Multiple R-Squared: 0.147, Adjusted R-squared: 0.141
Wald test: 8.2 on 3 and 424 DF, p-value: 0.0000257
confint(iv_mroz) 2.5 % 97.5 %
(Intercept) -1.10872794 0.8432156756
educ -0.00144087 0.1534767838
exper 0.01697917 0.0697096456
I(exper^2) -0.00166065 -0.0000816053
For the problem in part (c), estimate the first-stage equation. What is the value of the F-test statistic for the hypothesis that MOTHERCOLL has no effect on EDUC? Is MOTHERCOLL a strong instrument?
mroz_first_stage <- lm(educ ~ mothercoll + exper + I(exper^2), data = mroz_short)
## Method 1 - Manual
vcov_mroz_fs <- vcov(mroz_first_stage)
mothercoll_se <- sqrt(vcov_mroz_fs[2,2])
t_stat_fs <- coef(mroz_first_stage)[2]/mothercoll_se
F_stat_fs <- t_stat_fs^2
## Method 2 - linearHypothesis command
linearHypothesis(mroz_first_stage, c("mothercoll=0"))
Linear hypothesis test:
mothercoll = 0
Model 1: restricted model
Model 2: educ ~ mothercoll + exper + I(exper^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 425 2219
2 424 1930 1 289.3 63.56 1.46e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Estimate the wage equation in Example 10.5 using MOTHERCOLL and FATHERCOLL as the instrumental variables. What is the 95% interval estimate for the coefficient of EDUC? Is it narrower or wider than the one in part (c)?
iv_mroz_alt <- ivreg(I(log(wage)) ~ educ + exper + I(exper^2)
| mothercoll + fathercoll + exper + I(exper^2),
data = mroz_short)
summary(iv_mroz_alt)
Call:
ivreg(formula = I(log(wage)) ~ educ + exper + I(exper^2) | mothercoll +
fathercoll + exper + I(exper^2), data = mroz_short)
Residuals:
Min 1Q Median 3Q Max
-3.0780 -0.3213 0.0342 0.3765 2.3618
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.279082 0.392221 -0.71 0.4771
educ 0.087848 0.030781 2.85 0.0045 **
exper 0.042676 0.013295 3.21 0.0014 **
I(exper^2) -0.000849 0.000398 -2.13 0.0334 *
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 2 423 56.96 <2e-16 ***
Wu-Hausman 1 423 0.52 0.47
Sargan 1 NA 0.24 0.63
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.668 on 424 degrees of freedom
Multiple R-Squared: 0.153, Adjusted R-squared: 0.147
Wald test: 9.72 on 3 and 424 DF, p-value: 3.22e-06
confint(iv_mroz_alt) 2.5 % 97.5 %
(Intercept) -1.05002217 0.4918584602
educ 0.02734574 0.1483495631
exper 0.01654380 0.0688084522
I(exper^2) -0.00163002 -0.0000671754
For the problem in part (e), estimate the first-stage equation. Test the joint significance of MOTHERCOLL and FATHERCOLL. Do these instruments seem adequately strong?
mroz_first_stage_alt <- lm(educ ~ mothercoll + fathercoll + exper + I(exper^2),
data = mroz_short)
linearHypothesis(mroz_first_stage_alt, c("mothercoll=0", "fathercoll=0"))
Linear hypothesis test:
mothercoll = 0
fathercoll = 0
Model 1: restricted model
Model 2: educ ~ mothercoll + fathercoll + exper + I(exper^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 425 2219
2 423 1748 2 470.9 56.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-statistic from our test of joint significance for the two IV model is \(57\). This is far greater than the rule of thumb of 10 for a weak IV.
For the IV estimation in part (e), test the validity of the surplus instrument. What do you conclude?
sargan <- lm(resid(iv_mroz_alt) ~ exper + I(exper^2) + mothercoll + fathercoll,
data = mroz_short)
nrow(mroz_short) * summary(sargan)$r.squared[1] 0.237585
Using the estimates from (e), we obtain the residuals \(\hat{e}_{IV}\). The Sargan test, \(NR^{2}\), from this regression is 0.2376. This test statistic has a \(\chi_{(1)}^{2}\) distribution under the null hypothesis that the surplus IV is valid. The \(5\%\) critical value is 3.841, so we fail to reject the null hypothesis.
The CAPM says that the risk premium on security \(j\) is related to the risk premium on the market portfolio. That is \[ r_{j} - r_{f} = \alpha_{j} + \beta_{j}(r_{m} - r_{f}) \] where \(r_{j}\) and \(r_{f}\) are the returns to security \(j\) and the risk-free rate, respectively, \(r_{m}\) is the return on the market portfolio, and \(\beta_{j}\) is the \(j\)th security’s “beta” value. We measure the market portfolio using the Standard & Poor’s value weighted index, and the risk-free rate by the 30-day LIBOR monthly rate of return. As noted in Exercise 10.14, if the market return is measured with error, then we face an errors-in-variables, or measurement error, problem.
Use the observations on Microsoft in the data file
capm5to estimate the CAPM model using OLS. How would you classify the Microsoft stock over this period? Risky or relatively safe, relative to the market portfolio?
capm5 = read.csv("data/capm5.csv")
lm.msft = lm(I(msft - riskfree) ~ I(mkt - riskfree), data = capm5)
summary(lm.msft)
Call:
lm(formula = I(msft - riskfree) ~ I(mkt - riskfree), data = capm5)
Residuals:
Min 1Q Median 3Q Max
-0.2742 -0.0474 -0.0082 0.0387 0.3580
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00325 0.00604 0.54 0.59
I(mkt - riskfree) 1.20184 0.12215 9.84 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0808 on 178 degrees of freedom
Multiple R-squared: 0.352, Adjusted R-squared: 0.349
F-statistic: 96.8 on 1 and 178 DF, p-value: <2e-16
The estimated Microsoft beta is \(1.2018\), with a confidence interval of 0.960788, 1.442891. While the interval estimate includes 1, the majority of it is above 1. This indicates that the stock is relatively risky compared to the market portfolio.
It has been suggested that it is possible to construct an IV by ranking the values of the explanatory variable and using the rank as the IV, that is, we sort \((r_{m} - r_{f})\) from smallest to largest, and assign the values \(RANK = 1,2,\dots,180\). Does this variable potentially satisfy the conditions IV1-IV3? Create \(RANK\) and obtain the first-stage regression results. Is the coefficient on \(RANK\) very significant? What is the \(R^{2}\) of the first-stage regression? Can \(RANK\) be regarded as a strong IV?
sorted.mkt = sort(capm5$mkt - capm5$riskfree,decreasing = FALSE, index.return = TRUE)
capm5[sorted.mkt$ix,"rank"] = c(1:180)
rank.stage1 <- lm(I(mkt - riskfree) ~ rank, data = capm5)
summary(rank.stage1)
Call:
lm(formula = I(mkt - riskfree) ~ rank, data = capm5)
Residuals:
Min 1Q Median 3Q Max
-0.11050 -0.00631 0.00150 0.00943 0.02951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.079031 0.002195 -36.0 <2e-16 ***
rank 0.000907 0.000021 43.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0147 on 178 degrees of freedom
Multiple R-squared: 0.913, Adjusted R-squared: 0.912
F-statistic: 1.86e+03 on 1 and 178 DF, p-value: <2e-16
This suggested variable potentially satisfies the conditions due to our belief on where the endogeneity results from. From the discussion of equation (10.4), the endogeneity in this model is strictly due to measurement error in the market return and not because of some inherent endogeneity problem between the market return and Microsoft’s return. When we conduct this first-stage regression, we calculate a \(F\)-stat of \(\hat{t}^{2} = 1857.61\) and an \(R\)-squared of 0.913. This indicates that \(RANK\) is a strong instrument.
Compute the first-stage residuals, \(\hat{v}\), and add them to the CAPM model. Estimate the resulting augmented equation by OLS and test the significance of \(\hat{v}\) at the \(1\%\) level of significance. Can we conclude that the market return is exogenous?
capm5$stage1.resid = resid(rank.stage1)
haus.msft = lm(I(msft - riskfree) ~ I(mkt - riskfree) + stage1.resid, data = capm5)
summary(haus.msft)
Call:
lm(formula = I(msft - riskfree) ~ I(mkt - riskfree) + stage1.resid,
data = capm5)
Residuals:
Min 1Q Median 3Q Max
-0.2714 -0.0421 -0.0091 0.0342 0.3489
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00302 0.00598 0.50 0.615
I(mkt - riskfree) 1.27832 0.12675 10.09 <2e-16 ***
stage1.resid -0.87460 0.42863 -2.04 0.043 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0801 on 177 degrees of freedom
Multiple R-squared: 0.367, Adjusted R-squared: 0.36
F-statistic: 51.3 on 2 and 177 DF, p-value: <2e-16
This is equivalent to a Hausman test of endogeneity. The \(p\)-value for the coefficient on \(\hat{v}\) is \(0.043\). Thus, we cannot reject the null hypothesis that the market return is exogenous at the \(1\%\) level.
Use \(RANK\) as an IV and estimate the CAPM model by IV/2SLS. Compare this IV estimate to the OLS estimate in part (a). Does the IV estimate agree with your expectations?
iv.msft = ivreg(I(msft - riskfree) ~ I(mkt - riskfree)|rank, data = capm5)
summary(iv.msft)
Call:
ivreg(formula = I(msft - riskfree) ~ I(mkt - riskfree) | rank,
data = capm5)
Residuals:
Min 1Q Median 3Q Max
-0.27163 -0.04968 -0.00969 0.03768 0.35558
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00302 0.00604 0.50 0.62
I(mkt - riskfree) 1.27832 0.12801 9.99 <2e-16 ***
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 1 178 1857.59 <2e-16 ***
Wu-Hausman 1 177 4.16 0.043 *
Sargan 0 NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0809 on 178 degrees of freedom
Multiple R-Squared: 0.351, Adjusted R-squared: 0.347
Wald test: 99.7 on 1 and 178 DF, p-value: <2e-16
The IV estimate is slightly larger. This is expected as the measurement error bias should decrease the estimated parameter. The \(95\%\) confidence interval is 1.025704, 1.530933. All values are greater than 1, and so we can reject the null hypothesis that Microsoft’s beta is equal to 1. Thus, we conclude that Microsoft is relatively more risky than the market.
Create a new variable \(POS = 1\) if the market return \((r_{m} - r_{f})\) is positive, and zero otherwise. Obtain the first-stage regression results using both \(RANK\) and \(POS\) as instrumental variables. Test the joint significance of the IV. Can we conclude that we have adequately strong IV? What is the \(R^{2}\) of the first-stage regression?
capm5$pos = ((capm5$mkt - capm5$riskfree) > 0)
com.stage1 <- lm(I(mkt - riskfree) ~ rank + pos, data = capm5)
summary(com.stage1)
Call:
lm(formula = I(mkt - riskfree) ~ rank + pos, data = capm5)
Residuals:
Min 1Q Median 3Q Max
-0.10918 -0.00673 0.00286 0.00894 0.02665
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.080422 0.002262 -35.5 <2e-16 ***
rank 0.000982 0.000040 24.6 <2e-16 ***
posTRUE -0.009276 0.004215 -2.2 0.029 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0145 on 177 degrees of freedom
Multiple R-squared: 0.915, Adjusted R-squared: 0.914
F-statistic: 951 on 2 and 177 DF, p-value: <2e-16
We have an \(F\)-statistic of 951 and an \(R^2\) of 0.915. This indicates that these operate jointly as strong IVs.
Carry out the Hausman test for endogeneity using the residuals from the first-stage equation in (e). Can we conclude that the market return is exogenous at the \(1\%\) level of significance?
capm5$com.resid = resid(com.stage1)
hausE.msft = lm(I(msft - riskfree) ~ I(mkt - riskfree) + com.resid, data = capm5)
summary(hausE.msft)
Call:
lm(formula = I(msft - riskfree) ~ I(mkt - riskfree) + com.resid,
data = capm5)
Residuals:
Min 1Q Median 3Q Max
-0.2713 -0.0426 -0.0081 0.0334 0.3487
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00300 0.00597 0.50 0.616
I(mkt - riskfree) 1.28312 0.12634 10.16 <2e-16 ***
com.resid -0.95492 0.43306 -2.21 0.029 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08 on 177 degrees of freedom
Multiple R-squared: 0.37, Adjusted R-squared: 0.362
F-statistic: 51.9 on 2 and 177 DF, p-value: <2e-16
The \(p\)-value on the estimated residuals from the first-stage regression is \(0.0287\). Thus at the \(1\%\) level we cannot reject the null hypothesis that the market return is exogenous.
Obtain the IV/2SLS estimates of the CAPM model using \(RANK\) and \(POS\) as instrumental variables. Compare this IV estimate to the OLS estimate in part(a). Does the IV estimate agree with your expectations?
iv2.msft = ivreg(I(msft - riskfree) ~ I(mkt - riskfree)|rank + pos, data = capm5)
summary(iv2.msft)
Call:
ivreg(formula = I(msft - riskfree) ~ I(mkt - riskfree) | rank +
pos, data = capm5)
Residuals:
Min 1Q Median 3Q Max
-0.27168 -0.04960 -0.00983 0.03762 0.35543
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00300 0.00604 0.5 0.62
I(mkt - riskfree) 1.28312 0.12787 10.0 <2e-16 ***
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 2 177 951.26 <2e-16 ***
Wu-Hausman 1 177 4.86 0.029 *
Sargan 1 NA 0.56 0.455
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0809 on 178 degrees of freedom
Multiple R-Squared: 0.351, Adjusted R-squared: 0.347
Wald test: 101 on 1 and 178 DF, p-value: <2e-16
The coefficient estimate is larger than the OLS estimate from part (a). This is expected as the measurement error bias would bias the OLS estimates downwards.
Obtain the IV/2SLS residuals from part (g) and use them (not an automatic command) to carry out a Sargan test for the validity of the surplus IV at the \(5\%\) level of significance.
capm5$sargan = resid(iv2.msft)
sargan.lm = lm(sargan ~ rank + pos, data = capm5)
summary(sargan.lm)
Call:
lm(formula = sargan ~ rank + pos, data = capm5)
Residuals:
Min 1Q Median 3Q Max
-0.2691 -0.0470 -0.0080 0.0377 0.3567
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.002222 0.012633 -0.18 0.86
rank 0.000137 0.000223 0.61 0.54
posTRUE -0.017450 0.023541 -0.74 0.46
Residual standard error: 0.081 on 177 degrees of freedom
Multiple R-squared: 0.0031, Adjusted R-squared: -0.00816
F-statistic: 0.275 on 2 and 177 DF, p-value: 0.76
Our critical value of 3.841459 comes from the \(\chi_{(1)}^{2}\) distribution. Our \(N \times R^{2}\) statistic of 0.558463 is much smaller than the critical value. We cannot reject the validity of the surplus IV.
This question is an extension of exercise 10.22. Consider the data file
mrozon working wives and the model \(\log(WAGE) = \beta_{1} + \beta_{2}EDUC + \beta_{3}EXPER + e\). Use the 428 observations on married women who participate in the labor force. Let the instrumental variable be \(MOTHEREDUC\).
Write down in algebraic form the three moment conditions, like (10.16), that would lead to the IV/2SLS estimates of the model above.
\[\begin{align} \mathbb{E}[(\log(WAGE) - \beta_{1} + \beta_{2}EDUC + \beta_{3}EXPER)MOTHEREDUC] &= 0\\ \mathbb{E}[(\log(WAGE) - \beta_{1} + \beta_{2}EDUC + \beta_{3}EXPER)EXPER] &= 0\\ \mathbb{E}[(\log(WAGE) - \beta_{1} + \beta_{2}EDUC + \beta_{3}EXPER)] &= 0 \end{align}\]
Calculate the IV/2SLS estimates and residuals, \(\hat{e}_{IV}\). What is the sum of the IV residuals? What is \(\sum MOTHEREDUC_{i} \times \hat{e}_{IV,i}\)? What is \(\sum EXPER_{i} \times \hat{e}_{IV,i}\)? Relate these results to the moment conditions in (a).
## Read csv file, remove non-participants, add log wage variable
mroz = read.csv("data/mroz.csv")
mroz = mroz[mroz$lfp == 1,]
mroz$lwage = log(mroz$wage)
## Construct IV model
mroz.iv = ivreg(lwage ~ educ + exper | mothereduc + exper, data = mroz)
summary(mroz.iv)
Call:
ivreg(formula = lwage ~ educ + exper | mothereduc + exper, data = mroz)
Residuals:
Min 1Q Median 3Q Max
-3.0382 -0.3280 0.0234 0.3904 2.2348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.30228 0.47689 0.63 0.52652
educ 0.05424 0.03718 1.46 0.14533
exper 0.01544 0.00409 3.77 0.00019 ***
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 1 425 75.2 <2e-16 ***
Wu-Hausman 1 424 2.7 0.1
Sargan 0 NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.681 on 425 degrees of freedom
Multiple R-Squared: 0.118, Adjusted R-squared: 0.114
Wald test: 7.97 on 2 and 425 DF, p-value: 0.000399
## Pull Residuals and calculate them
mroz$iv.resid = resid(mroz.iv)We have the following: \[\begin{align*} \sum_{i = 1}^{N}\hat{e}_{i} \times MOTHEREDUC_{i} &= 8.792966\times 10^{-13}\\ \sum_{i = 1}^{N}\hat{e}_{i} \times EXPER_{i} &= 1.273204\times 10^{-12}\\ \sum_{i = 1}^{N}\hat{e}_{i} &= 7.704948\times 10^{-14}\\ \end{align*}\]
These results perfectly correspond with the moment conditions in part (a).
What is \(\sum EDUC_{i} \times \hat{e}_{IV,i}\)? What is the sum of squared IV residuals? How do these two results compare with the corresponding OLS results in Exercise 10.22(b)?
## Construct OLS model
mroz.ols = lm(lwage ~ educ + exper, data = mroz)
## Pull Residuals and calculate some values
mroz$ols.resid = resid(mroz.ols)
sum(mroz$ols.resid * mroz$educ)[1] 2.84217e-14
sum(mroz$iv.resid * mroz$educ)[1] 123.18
sum(mroz$ols.resid^2)[1] 190.195
sum(mroz$iv.resid^2)[1] 197
We have \(\sum EDUC \times \hat{e}_{IV} = 123.18\). In comparison, the OLS regression has \(\sum EDUC \times \hat{e}_{OLS} = 0\). This aligns with the difference between the moment conditions of the IV regression compared to the first order conditions of the OLS estimator.
We have \(\sum \hat{e}_{IV}^{2} = 197.00\). In comparison, the OLS regression has \(\sum \hat{e}_{OLS}^{2} = 190.19\). This is expected as IV estimates are always less efficient than their OLS counterparts due to the construction of the estimator. OLS always minimizes this sum explicitly, while the IV estimator tries to solve a slightly altered set of conditions.
Calculate the IV/2SLS fitted values \(FLWAGE = \hat{\beta}_{1} + \hat{\beta}_{2} + \hat{\beta}_{3}EXPER\). What is the sample average of the fitted values? What is the sample average of \(\log(WAGE)\), \(\overline{\log(WAGE)}\)?
flwage = fitted(mroz.iv)
mean(flwage)[1] 1.19017
mean(mroz$lwage)[1] 1.19017
The sample average of the fitted values and sample average are identical with both calculated to be \(1.19017\). This is due to the third moment condition listed in part (a) which sets the average of the residuals equal to zero.
Find each of the following
\[\begin{align*} SST &= \sum\left[\log(WAGE_{i}) - \overline{\log(WAGE_{i})}\right]^{2}\\ SSE_{IV} &= \sum\hat{e}_{IV}^{2}\\ SSR_{IV} &= \sum\left[FLWAGE - \overline{\log(WAGE)}\right]^{2} \end{align*}\]
Compute \(SSR_{IV} + SSE_{IV}\), \(R_{IV,1}^{2} = SSR_{IV}/SST\), and \(R_{IV,2}^{2} = 1 - SSE_{IV}/SST\). How do these values compare to those in Exercise 10.22(d)?
SST = nrow(mroz)*var(mroz$lwage)
iv.SSE = sum(mroz$iv.resid^2)
iv.SSR = sum((flwage - mean(mroz$lwage))^2)
iv.SST = iv.SSE + iv.SSR
iv.R1 = iv.SSR/SST
iv.R2 = 1 - iv.SSE/SSTWe compute the following values
\[\begin{align*} SST &= 223.850456\\ SSE_{IV} &= 197.000165\\ SSR_{IV} &= 12.963921\\ SSR_{IV} + SSE_{IV} &= 209.964086\\ R_{IV,1}^{2} &= 0.057913\\ R_{IV,2}^{2} &= 0.119947 \end{align*}\]
In the IV setting, the classic \(SSE + SSR = SST\) formula does not hold. As such, the usual \(R^{2}\) calculations seen in the OLS examples do not hold.
Does your IV/2SLS software report an \(R^{2}\) value? Is it either of the ones in (e)? Explain why the usual concept of \(R^{2}\) fails to hold for IV/2SLS estimation.
The ivreg() command computes an \(R^{2}\) that aligns with the second definition. The usual concept of \(R^{2}\) fails to hold as the standard decomposition no longer applies with endogenous variables. Note that when deriving the \(R^{2}\) decomposition formula, the following equation shows up: \[
SST = SSE + SSR + 2\sum \hat{e}_{i}\hat{y}_{i}
\] When we assume that the errors are exogenous of all \(x\) variables as in OLS, the final sum will be equal to zero. However, when any \(X\) variables are endogenous, then the final sum will no longer be equal to zero. This leads to the disparity in \(R^{2}\) calculations.
Thank you to Coleman Cornell for generously sharing his materials with me.