Chapter 15: Discussion Problems

Panel Data

Panel Data
Discussion Problems
Author

Jake Anderson

Published

March 3, 2026

Modified

March 4, 2026

NotePrerequisites

These problems accompany the Panel Data chapter and its sub-pages on Fixed Effects, Random Effects, Cluster-Robust SEs, and Hausman-Taylor. Read those first for the theory behind these exercises.

Discussion Problems

Question 15.21

This exercise uses data from the STAR experiment introduced to illustrate fixed and random effects for grouped data. It replicates Exercise 15.20 with teachers (\(TCHID\)) being chosen as the cross section of interest. In the STAR experiment, children were randomly assigned within schools into three types of classes: small classes with 13-17 students, regular-sized classes with 22–25 students, and regular-sized classes with a full-time teacher aide to assist the teacher. Student scores on achievement tests were recorded as well as some information about the students, teachers, and schools. Data for the kindergarten classes are contained in the data file star.

Part A

Estimate a regression equation (with no fixed or random effects) where \(READSCORE\) is related to \(SMALL\), \(AIDE\), \(TCHEXPER\), \(TCHMASTERS\), \(BOY\), \(WHITE\_ASIAN\), and \(FREELUNCH\). Discuss the results. Do students perform better in reading when they are in small classes? Does a teacher’s aide improve scores? Do the students of more experienced teachers score higher on reading tests? Does gender or race make a difference?

Solution

Show code
star <- read.csv("data/star.csv")
p.star <- pdata.frame(star, c("tchid", "id"))
pooled.lm <- plm(readscore ~ small + aide + tchexper + tchmasters + boy +
            white_asian + freelunch,
            data = star,
            model = "pooling",
            index = c("tchid", "schid"))
summary(pooled.lm)
Pooling Model

Call:
plm(formula = readscore ~ small + aide + tchexper + tchmasters + 
    boy + white_asian + freelunch, data = star, model = "pooling", 
    index = c("tchid", "schid"))

Unbalanced Panel: n = 324, T = 4-27, N = 5766

Residuals:
    Min.  1st Qu.   Median  3rd Qu.     Max. 
-107.683  -20.143   -3.946   14.405  185.627 

Coefficients:
             Estimate Std. Error t-value Pr(>|t|)    
(Intercept) 437.95594    1.34995 324.425  < 2e-16 ***
small         5.74890    0.98994   5.807 6.69e-09 ***
aide          0.80935    0.95281   0.849   0.3957    
tchexper      0.52159    0.07131   7.314 2.94e-13 ***
tchmasters   -1.58888    0.86145  -1.844   0.0652 .  
boy          -6.15330    0.79596  -7.731 1.25e-14 ***
white_asian   4.08296    0.95823   4.261 2.07e-05 ***
freelunch   -14.76706    0.89007 -16.591  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    5810000
Residual Sum of Squares: 5244000
R-Squared:      0.09739
Adj. R-Squared: 0.09629
F-statistic: 88.7501 on 7 and 5758 DF, p-value: < 2.2e-16

The OLS model estimates that students in small class rooms (13 to 17 students large) have an average reading test score that is 5.75 points higher than other classrooms. This estimate is statistically significant. While the coefficient estimate on aide is positive, it is not statistically significant. On average, one additional year of teaching experience improves student reading scores by 0.52 points. This estimate is statistically significant. Coefficients on race and gender are statistically significant, with boys having an estimated average test score that is 6.15 points lower than girls, and white/asian students having an estimated test score that is 4.08 points higher than members of other races.

Part B

Repeat the estimation in (a) using cluster-robust standard errors, with the cluster defined by individual teachers, \(TCHID\). Are the robust standard errors larger or smaller. Compare the 95% interval estimate for the coefficient of \(SMALL\) using conventional and robust standard errors.

Solution

Show code
pooled.cluster <- coeftest(pooled.lm, vcov = vcovHC(pooled.lm, cluster = "group"))
stargazer(pooled.lm, pooled.cluster, type = "text", ci = TRUE, model.names = FALSE,
         dep.var.labels = c("readscore", "readscore"),
         column.labels = c("OLS se", "Clustered se"))

========================================================
                         Dependent variable:            
             -------------------------------------------
                    readscore             readscore     
                      OLS se             Clustered se   
                       (1)                   (2)        
--------------------------------------------------------
small                5.749***              5.749**      
                  (3.809, 7.689)       (1.306, 10.192)  
                                                        
aide                  0.809                 0.809       
                 (-1.058, 2.677)       (-3.595, 5.214)  
                                                        
tchexper             0.522***              0.522***     
                  (0.382, 0.661)        (0.158, 0.885)  
                                                        
tchmasters           -1.589*                -1.589      
                 (-3.277, 0.100)       (-5.417, 2.239)  
                                                        
boy                 -6.153***             -6.153***     
                 (-7.713, -4.593)      (-7.781, -4.525) 
                                                        
white_asian          4.083***              4.083**      
                  (2.205, 5.961)        (0.363, 7.803)  
                                                        
freelunch           -14.767***            -14.767***    
                (-16.512, -13.023)    (-16.890, -12.644)
                                                        
Constant            437.956***            437.956***    
                (435.310, 440.602)    (433.075, 442.837)
                                                        
--------------------------------------------------------
Observations          5,766                             
R2                    0.097                             
Adjusted R2           0.096                             
F Statistic  88.750*** (df = 7; 5758)                   
========================================================
Note:                        *p<0.1; **p<0.05; ***p<0.01

Estimated coefficients and confidence intervals are given for the model, with OLS SE estimates in column (1) and Clustered SE in column (2). In general, estimated confidence intervals have become wider as a result of implementing the cluster standard errors. In particular, the estimated CI for small has widened from \([3.809, 7.689]\) to \([1.306, 10.192]\), which indicates that we have much less certainty with the coefficient estimate we have.

Part C

Reestimate the model in part (a) with teacher random effects and using both conventional and cluster-robust standard errors. Compare these results with those from parts (a) and (b).

Solution

Show code
re.plm <- plm(readscore ~ small + aide + tchexper + tchmasters + boy +
            white_asian + freelunch,
            data = star,
            model = "random",
            index = c("tchid", "schid"),
            method = "warhus")
re.cluster <- coeftest(re.plm, vcov = vcovHC(re.plm, cluster = "group"))
stargazer(pooled.cluster, re.plm, re.cluster, type = "text", model.names = FALSE,
         column.labels = c("Pooled w/ clustered", "Random Effects", "RE w/ clustered"))

===============================================================
                            Dependent variable:                
             --------------------------------------------------
                                   readscore                   
             Pooled w/ clustered Random Effects RE w/ clustered
                     (1)              (2)             (3)      
---------------------------------------------------------------
small              5.749**          5.625**         5.625**    
                   (2.267)          (2.215)         (2.263)    
                                                               
aide                0.809            0.833           0.833     
                   (2.247)          (2.322)         (2.234)    
                                                               
tchexper          0.522***          0.442***        0.442**    
                   (0.185)          (0.162)         (0.179)    
                                                               
tchmasters         -1.589            -1.705         -1.705     
                   (1.953)          (1.974)         (1.911)    
                                                               
boy               -6.153***        -5.133***       -5.133***   
                   (0.831)          (0.704)         (0.762)    
                                                               
white_asian        4.083**          6.138***       6.138***    
                   (1.898)          (1.275)         (1.417)    
                                                               
freelunch        -14.767***        -14.653***     -14.653***   
                   (1.083)          (0.843)         (0.858)    
                                                               
Constant         437.956***        437.017***     437.017***   
                   (2.490)          (2.436)         (2.484)    
                                                               
---------------------------------------------------------------
Observations                         5,766                     
R2                                   0.339                     
Adjusted R2                          0.338                     
F Statistic                        462.086***                  
===============================================================
Note:                               *p<0.1; **p<0.05; ***p<0.01

The results are very similar to the previous results after applying clustered standard errors.

Part D

Are there any variables in the equation that might be correlated with the teacher effects? Recall that teachers were randomly assigned within schools, but not across schools. Create teacher-level averages of the variables \(BOY\), \(WHITE\_ASIAN\), and \(FREELUNCH\) and carry out the Mundlak test for correlation between them and the unobserved heterogeneity.

Solution

Teacher effects may be correlated with schoolwide demographic data, i.e. teachers in one school might be better on average than another school due to the resources offered by the schools. Creating teacher-level averages of demographic variables can help us find this possible correlation through a Mundlak test:

Show code
# Get average of variables by teacher
teacher.mean <- aggregate(cbind(boy, white_asian, freelunch) ~ tchid, star, mean)
# Rename columns to indicate they are averages
names(teacher.mean) <- paste(names(teacher.mean), c("", "m", "m", "m"), sep = "")
# Merge averages with original panel data
star.m <- merge(star, teacher.mean, on = "tchid")
# Estimate random effects with averaged variables
plm.random.m <- plm(readscore ~ small + aide + tchexper + tchmasters + boy +
                    white_asian + freelunch + boym + white_asianm + freelunchm,
                    data = star.m,
                    index = c("tchid", "schid"),
                    model = "random",
                    random.method = "walhus")
# Test that average variables are jointly different from 0
linearHypothesis(plm.random.m, c("boym = 0", "white_asianm = 0", "freelunchm = 0"))

Linear hypothesis test:
boym = 0
white_asianm = 0
freelunchm = 0

Model 1: restricted model
Model 2: readscore ~ small + aide + tchexper + tchmasters + boy + white_asian + 
    freelunch + boym + white_asianm + freelunchm

  Res.Df Df Chisq Pr(>Chisq)  
1   5758                      
2   5755  3 7.209     0.0655 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Mundlak test checks the joint significance of the parameters on the average demographic characteristics. The test statistic is \(7.209\), which is less than the cutoff value of \(\chi^{2}_{(0.95,3)} = 7.815\). We cannot rejected the null hypothesis that the random effects are uncorrelated with the regressors, and thus random effects is a suitable model.

Part E

Suppose that we treat \(FREELUNCH\) as endogenous. Use the Hausman–Taylor estimator for this model. Compare the results to the OLS estimates in (a) and the random effects estimates in part (d). Do you find any substantial differences?

Solution

Show code
ht <- plm(readscore ~ small + aide + tchexper + tchmasters + boy +
            white_asian + freelunch,
            data = star,
            model = "random",
            index = c("tchid", "schid"),
            random.method = "ht")
stargazer(pooled.cluster, re.plm, ht, type = "text", model.names = FALSE,
         column.labels = c("Pooled w/ clustered", "Random Effects", "Hausman-Taylor"))

==============================================================
                            Dependent variable:               
             -------------------------------------------------
                                           readscore          
             Pooled w/ clustered Random Effects Hausman-Taylor
                     (1)              (2)            (3)      
--------------------------------------------------------------
small              5.749**          5.625**        5.730**    
                   (2.267)          (2.215)        (2.281)    
                                                              
aide                0.809            0.833          0.851     
                   (2.247)          (2.322)        (2.195)    
                                                              
tchexper          0.522***          0.442***       0.512***   
                   (0.185)          (0.162)        (0.164)    
                                                              
tchmasters         -1.589            -1.705         -1.803    
                   (1.953)          (1.974)        (1.977)    
                                                              
boy               -6.153***        -5.133***      -5.159***   
                   (0.831)          (0.704)        (0.704)    
                                                              
white_asian        4.083**          6.138***       6.045***   
                   (1.898)          (1.275)        (1.274)    
                                                              
freelunch        -14.767***        -14.653***     -14.603***  
                   (1.083)          (0.843)        (0.843)    
                                                              
Constant         437.956***        437.017***     436.193***  
                   (2.490)          (2.436)        (2.386)    
                                                              
--------------------------------------------------------------
Observations                         5,766          5,766     
R2                                   0.339          0.074     
Adjusted R2                          0.338          0.073     
F Statistic                        462.086***     462.012***  
==============================================================
Note:                              *p<0.1; **p<0.05; ***p<0.01

The estimates in the Hausman-Taylor model are similar to the previous estimates from the pooled and Random Effects models.

Question 15.25

Consider the production relationship on Chinese firms used in several chapter examples. We now add another input, \(MATERIALS\). Use the data set from the data file chemical3 for this exercise. (The data file chemical includes many more firms.)

\[ \ln\left(SALES_{it}\right)= \beta_1 + \beta_2 \ln\left(CAPITAL_{it}\right) + \beta_3 \ln\left(LABOR_{it}\right) + \beta_4 \ln\left(MATERIALS_{it}\right) + u_i +e_{it}\]

Part A

Estimate this model using OLS. Compute conventional, heteroskedasticity robust, and cluster-robust standard errors. Using each type of standard error construct a 95% interval estimate for the elasticity of \(SALES\) with respect to \(MATERIALS\). What do you observe about these intervals?

Solution

Show code
chemical3 <- read.csv("data/chemical3.csv")

# Estimate model with OLS
pf.ols <- plm(lsales ~ lcapital + llabor + lmaterials,
              data = chemical3,
              model = "pooling",
              index = c("firm", "year"))

# Adjust errors for HC errors
pf.hc <- coeftest(pf.ols, vcov = vcovHC(pf.ols, method = "white1", type = "HC1"))

# Adjust errors for Clustered
pf.cl <- coeftest(pf.ols, vcov = vcovHC(pf.ols, type = "HC1", cluster = "group"))

# Combine
data.frame("ols" = confint(pf.ols)["lmaterials", ],
           "hc" = confint(pf.hc)["lmaterials", ],
           "cluster" = confint(pf.cl)["lmaterials", ])
            ols       hc  cluster
2.5 %  0.729452 0.721738 0.713922
97.5 % 0.754395 0.762109 0.769925

The confidence intervals become wider as we allow for heteroskedasticity (OLS to HC1 errors) and clustering (HC1 to Clustered).

Part B

Using each type of standard error in part (a), test at the 5% level the null hypothesis of constant returns to scale, \(\beta_2+\beta_3+\beta_4=1\) versus the alternative \(\beta_2+\beta_3+\beta_4 \neq 1\). Are the results consistent?

Solution

Show code
# Compute the t test with every version of standard errors
f.ols <- linearHypothesis(pf.ols, "lcapital + llabor + lmaterials = 1")
f.hc <- linearHypothesis(pf.ols, "lcapital + llabor + lmaterials = 1",
                         vcov = vcovHC(pf.ols, method = "white1", type = "HC1"))
f.cl <- linearHypothesis(pf.ols, "lcapital + llabor + lmaterials = 1",
                         vcov = vcovHC(pf.ols, type = "HC1", cluster = "group"))

# Report the test statistics
data.frame(rbind("ols" = f.ols[2, 3:4],
           "hc" = f.hc[2, 3:4],
           "cluster" = f.cl[2, 3:4]))
          Chisq  Pr..Chisq.
ols     65.2858 6.47859e-16
hc      58.0285 2.58351e-14
cluster 28.9595 7.39058e-08

In each case, we reject the null hypothesis of constant returns to scale at the \(1\%\) or better level.

Part C

Use the OLS residuals from (a) and carry out the \(N \times R^2\) test from Chapter 9 to test for AR(1) serial correlation in the errors using the 2005 and 2006 data. Is there evidence of serial correlation? What factors might be causing it?

Solution

Show code
# Estimate production function using only 2005 and 2006 data
pf.plm0506 <- plm(lsales ~ lcapital + llabor + lmaterials,
                  data = chemical3[chemical3$year == 2005 | chemical3$year == 2006, ],
                  model = "pooling",
                  index = c("firm", "year"))
# Run Breusch-Pagan test for serial correlation
pbgtest(pf.plm0506)

    Breusch-Godfrey/Wooldridge test for serial correlation in panel models

data:  lsales ~ lcapital + llabor + lmaterials
chisq = 249.8, df = 2, p-value <2e-16
alternative hypothesis: serial correlation in idiosyncratic errors

We reject the null hypothesis of no serial correlation. With the pooled model, some unobserved firm-level heterogeneity is present across both years, leading to the serial correlation.

Part D

Estimate the model using random effects. How do these estimates compare to the OLS estimates? Test the null hypothesis \(\beta_2+\beta_3+\beta_4=1\) versus the alternative \(\beta_2+\beta_3+\beta_4\neq 1\). What do you conclude. Is there evidence of unobserved heterogeneity? Carry out the LM test for the presence of random effects at the 5% level of significance.

Solution

Show code
# Estimate production function with random effects
pf.random <- plm(lsales ~ lcapital + llabor + lmaterials,
                 data = chemical3,
                 model = "random",
                 index = c("firm", "year"))
# Print out summary of OLS and random effects
stargazer(pf.ols, pf.random, type = "text")

=======================================================
                        Dependent variable:            
             ------------------------------------------
                               lsales                  
                         (1)                   (2)     
-------------------------------------------------------
lcapital               0.104***             0.102***   
                       (0.007)               (0.008)   
                                                       
llabor                 0.105***             0.130***   
                       (0.010)               (0.012)   
                                                       
lmaterials             0.742***             0.700***   
                       (0.006)               (0.008)   
                                                       
Constant               1.641***             1.948***   
                       (0.049)               (0.064)   
                                                       
-------------------------------------------------------
Observations            3,000                 3,000    
R2                      0.920                 0.859    
Adjusted R2             0.920                 0.859    
F Statistic  11,515.500*** (df = 3; 2996) 18,227.600***
=======================================================
Note:                       *p<0.1; **p<0.05; ***p<0.01
Show code
# Test if there are constant returns to scale
linearHypothesis(pf.random, c("lcapital + llabor + lmaterials = 1"))

Linear hypothesis test:
lcapital  + llabor  + lmaterials = 1

Model 1: restricted model
Model 2: lsales ~ lcapital + llabor + lmaterials

  Res.Df Df Chisq Pr(>Chisq)    
1   2997                        
2   2996  1 68.59     <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show code
# Test for random effects
plmtest(pf.random, type = "bp")

    Lagrange Multiplier Test - (Breusch-Pagan)

data:  lsales ~ lcapital + llabor + lmaterials
chisq = 802, df = 1, p-value <2e-16
alternative hypothesis: significant effects

Compared to before, the random effects model estimates a larger return to labor, but a smaller return to materials. The test for constant returns to scale is still rejected, with a p-value very close to zero. Finally, our test of random effects yields unobserved firm-level heterogeneity.

Part E

Estimate the model using fixed effects. How do the estimates compare to those in (d)? Use the Hausman test for the significance of the difference in the coefficients. Is there evidence that the unobserved heterogeneity is correlated with one or more of the explanatory variables? Explain.

Solution

Show code
# Estimate fixed effects model
pf.fixed <- plm(lsales ~ lcapital + llabor + lmaterials,
                data = chemical3,
                model = "within",
                index = c("firm", "year"))
# Display fixed and random effects estimates
stargazer(pf.fixed, pf.random, type = "text")

====================================================
                       Dependent variable:          
             ---------------------------------------
                             lsales                 
                        (1)                 (2)     
----------------------------------------------------
lcapital             0.052***            0.102***   
                      (0.013)             (0.008)   
                                                    
llabor               0.106***            0.130***   
                      (0.021)             (0.012)   
                                                    
lmaterials           0.597***            0.700***   
                      (0.012)             (0.008)   
                                                    
Constant                                 1.948***   
                                          (0.064)   
                                                    
----------------------------------------------------
Observations           3,000               3,000    
R2                     0.589               0.859    
Adjusted R2            0.383               0.859    
F Statistic  955.840*** (df = 3; 1997) 18,227.600***
====================================================
Note:                    *p<0.1; **p<0.05; ***p<0.01

The estimates from the pooled model look very different from the random effects estimates. This indicates that there may be some correlation between some regressors and the random effects. To test for this, we conduct a Hausman test:

Show code
# Hausman test for correlation between regressors and random effect
phtest(pf.fixed, pf.random)

    Hausman Test

data:  lsales ~ lcapital + llabor + lmaterials
chisq = 158.4, df = 3, p-value <2e-16
alternative hypothesis: one model is inconsistent

The Hausman test statistic is \(158.39\), and given the critical value of \(7.815\), we conclude that there is correlation between the random effect and some regressors.

Part F

Obtain the fixed effects residuals, \(\tilde{e}_{it}\). Using OLS with cluster-robust standard errors estimate the regression \(\tilde{e}_{it} =\rho\tilde{e}_{i, t-1}\), where \(r\) is a random error. As noted in Exercise 15.10, if the idiosyncratic errors \(e_{it}\) are uncorrelated we expect \(\rho = -1/2\). Rejecting this hypothesis implies that idiosyncratic errors \(e_{it}\) are serially correlated. Using the 5% level of significance, what do you conclude?

Solution

Show code
# Add residuals to dataframe
chemical3$fixedresid <- resid(pf.fixed)
# Estimate regression of residuals
resid.lm <- plm(fixedresid ~ lag(fixedresid) - 1,
                data = chemical3,
                model = "pooling",
                index = c("firm", "year"))
# Test if errors are uncorrelated
linearHypothesis(resid.lm,
                 c("lag(fixedresid) = -0.5"),
                 vcov = vcovHC(resid.lm, cluster = "group"))

Linear hypothesis test:
lag(fixedresid) = - 0.5

Model 1: restricted model
Model 2: fixedresid ~ lag(fixedresid) - 1

Note: Coefficient covariance matrix supplied.

  Res.Df Df Chisq Pr(>Chisq)    
1   2000                        
2   1999  1 34.72   3.81e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Given a test statistic of \(34.72\), we reject the null hypothesis of no autocorrelation.

Part G

Estimate the model by fixed effects using cluster-robust standard errors. How different are these standard errors from the conventional ones in part (e)?

Solution

Show code
pf.fixed.robust <- coeftest(pf.fixed,
                            vcov = vcovHC(pf.fixed, cluster = "group"))
stargazer(pf.fixed, pf.fixed.robust, type = "text")

==================================================
                      Dependent variable:         
             -------------------------------------
                      lsales                      
                       panel           coefficient
                      linear              test    
                        (1)                (2)    
--------------------------------------------------
lcapital             0.052***           0.052***  
                      (0.013)            (0.016)  
                                                  
llabor               0.106***           0.106***  
                      (0.021)            (0.026)  
                                                  
lmaterials           0.597***           0.597***  
                      (0.012)            (0.029)  
                                                  
--------------------------------------------------
Observations           3,000                      
R2                     0.589                      
Adjusted R2            0.383                      
F Statistic  955.840*** (df = 3; 1997)            
==================================================
Note:                  *p<0.1; **p<0.05; ***p<0.01

The standard errors are similar for capital and labor, but the estimated standard error for materials has more than doubled.

Question 15.28

The data file collegecost contains data on cost per student and related factors at four-year colleges in the U.S., covering the period 1987 to 2011. In this exercise, we explore a minimalist model predicting cost per student. Specify the model to be:

\[ \ln\left(TC_{it} \right) = \beta_1 + \beta_2 FTESTU_{it} + \beta_3 FTGRAD_{it} + \beta_4 TT_{it} + \beta_5 GA_{it} + \beta_6 CF_{it} + \sum_{t=2}^8 \gamma_t D_t +u_t +e_{it}\]

where TC is the total cost per student, \(FTESTU\) is number of full-time equivalent students, \(FTGRAD\) is number of full-time graduate students, \(TT\) is number of tenure track faculty per 100 students, \(GA\) is number of graduate assistants per 100 students, and \(CF\) is the number of contract faculty, which are hired on a year to year basis. The \(D_t\) are indicator variables for the years 1989, 1991, 1999, 2005, 2008, 2010, and 2011. The base year is 1987. Use data only on public universities for this question.

Part A

Create first differences of the variables. Using the 2011 data, estimate by OLS the first-difference model

\[\Delta \ln\left(TC_{it}\right) = \beta_2 \Delta FTESTU_{it} + \beta_3 \Delta FTGRAD_{it} + \beta_4 \Delta TT_{it} + \beta_5 \Delta GA_{it} + \beta_6 \Delta CF_{it} + \Delta e_{it}\]

Solution

Show code
collegecost <- read.csv("data/collegecost.csv")
# Use data for public universities only
collegecost <- collegecost[collegecost$private == 0, ]
# Subset for 2010 and 2011 data
collegecost.1011 <- collegecost[collegecost$year == 2011 |
                                collegecost$year == 2010, ]
# Convert to panel data
college.pd <- pdata.frame(collegecost.1011, c("unitid", "year"))
# Estimate first differences model
fd.plm <- plm(diff(log(college.pd$tc)) ~ diff(ftestu) + diff(ftgrad)
              + diff(tt) + diff(ga) + diff(cf) - 1,
              model = "pooling",
              data = college.pd)
summary(fd.plm)
Pooling Model

Call:
plm(formula = diff(log(college.pd$tc)) ~ diff(ftestu) + diff(ftgrad) + 
    diff(tt) + diff(ga) + diff(cf) - 1, data = college.pd, model = "pooling")

Balanced Panel: n = 141, T = 1, N = 141

Residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.09221 -0.01342  0.00673  0.00746  0.02357  0.11712 

Coefficients:
              Estimate Std. Error t-value Pr(>|t|)    
diff(ftestu) -0.023483   0.004334  -5.419 2.65e-07 ***
diff(ftgrad)  0.012421   0.020447   0.607   0.5446    
diff(tt)      0.008086   0.007093   1.140   0.2563    
diff(ga)     -0.000548   0.001436  -0.382   0.7034    
diff(cf)      0.012385   0.005207   2.379   0.0188 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    0.2164
Residual Sum of Squares: 0.1576
R-Squared:      0.3133
Adj. R-Squared: 0.2931
F-statistic: 10.1561 on 5 and 136 DF, p-value: 2.777e-08

The coefficient estimates on full-time graduate students, tenure track faculty per 100 students, and graduate assistants per 100 students are insignificant. An increase in 100 full-time equivalent student reduces ATC by \(2.4\%\), while increasing contract faculty increases ATC by \(1.2\%\).

Part B

Repeat the estimation in (a) adding an intercept term. What is the interpretation of the constant?

Solution

Show code
# Estimate first differences model with intercept
fd.plm.int <- plm(diff(log(college.pd$tc)) ~ diff(ftestu) + diff(ftgrad)
                  + diff(tt) + diff(ga) + diff(cf),
                  model = "pooling",
                  data = college.pd)
summary(fd.plm.int)
Pooling Model

Call:
plm(formula = diff(log(college.pd$tc)) ~ diff(ftestu) + diff(ftgrad) + 
    diff(tt) + diff(ga) + diff(cf), data = college.pd, model = "pooling")

Balanced Panel: n = 141, T = 1, N = 141

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-0.102024 -0.014865 -0.001281  0.018912  0.099859 

Coefficients:
                Estimate  Std. Error t-value Pr(>|t|)    
(Intercept)   0.02135988  0.00450891   4.737 5.42e-06 ***
diff(ftestu) -0.02460713  0.00403496  -6.098 1.06e-08 ***
diff(ftgrad)  0.00624508  0.01904837   0.328   0.7435    
diff(tt)      0.04167462  0.00968173   4.304 3.19e-05 ***
diff(ga)     -0.00008158  0.00133857  -0.061   0.9515    
diff(cf)      0.01112005  0.00484652   2.294   0.0233 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    0.2164
Residual Sum of Squares: 0.1351
R-Squared:      0.3756
Adj. R-Squared: 0.3525
F-statistic: 16.2444 on 5 and 135 DF, p-value: 1.592e-12

The intercept estimates indicates an average year to year increase in ATC holding other values fixed. Between 2010 and 2011, the average college increased their ATC by \(2.1\%\). Among the slope estimates, only the coefficient on tenure track faculty has changed. The estimated coefficient is now statistically significant and implies that increasing tenure track faculty per 100 students increases ATC by \(4.2\%\).

Part C

Repeat the estimation in (a) adding an intercept plus the 2011 observations on the variables \(FTESTU\), \(FTGRAD\), \(TT\), \(GA\), and \(CF\). If the assumption of strict exogeneity holds none of the coefficients on these variables should be significant, and they should be jointly insignificant as well. What do you conclude? Why is this assumption important for the estimation of panel data regression models?

Solution

Show code
# Estimate first differences model
fd.plm.levels <- plm(diff(log(tc)) ~ diff(ftestu) + diff(ftgrad)
                      + diff(tt) + diff(ga) + diff(cf) + 1
                      + ftestu + ftgrad + tt + ga + cf,
                      model = "pooling",
                      data = college.pd)
# F test of joint significance of the levels variables
pFtest(fd.plm.levels, fd.plm.int)

    F test for individual effects

data:  diff(log(tc)) ~ diff(ftestu) + diff(ftgrad) + diff(tt) + diff(ga) +  ...
F = 1.03, df1 = 5, df2 = 130, p-value = 0.403
alternative hypothesis: significant effects

The F-statistic for joint significance of the added variables is \(1.03\), which is below the \(5\%\) critical value of \(F_{(5,130)} = 2.28\). We fail to reject the null hypothesis that these coefficients are jointly significant. The strict exogeneity assumption is required in panel data sets to prevent bias similar to that discussed in chapters 10 and 11. This version is stronger as it requires the errors to be unrelated with the RHS variables across other time periods as well.

Part D

Create the one period future, or forward, value for each variable, \(x_{t+1}\). That is, for example, in year \(t\) create a new variable \(FTESTU_{i,t+1}\). Using data from 2008 and 2010, estimate the panel data regression model by fixed effects, including the forward values of \(FTESTU\), \(FTGRAD\), \(TT\), \(GA\), and \(CF\). If the assumption of strict exogeneity holds none of the coefficients on these variables should be significant, and they should be jointly insignificant as well. What do you conclude?

Solution

Show code
# Subset for 2008 and 2010 data
collegecost.0810 <- collegecost[collegecost$year == 2008 |
                                  (collegecost$year == 2010 |
                                   collegecost$year == 2011), ]
# Do some adjustments to account for different period "lengths"
collegecost.0810$period <- 1 * (collegecost.0810$year == 2008) +
                          2 * (collegecost.0810$year == 2010) +
                          3 * (collegecost.0810$year == 2011)

# Convert to panel data
college.pd0810 <- pdata.frame(collegecost.0810, c("unitid", "period"))

# Estimate fixed effects (within) panel model with leading variables
fe.lead <- plm(log(tc) ~ ftestu + ftgrad + tt + ga + cf + factor(year)
               + lead(ftestu) + lead(ftgrad) + lead(tt) + lead(ga) + lead(cf),
               data = college.pd0810,
               model = "within")

linearHypothesis(fe.lead, c("lead(ftestu) = 0",
                            "lead(ftgrad) = 0",
                            "lead(tt) = 0",
                            "lead(ga) = 0",
                            "lead(cf) = 0"), test = "F")

Linear hypothesis test:
lead(ftestu) = 0
lead(ftgrad) = 0
lead(tt) = 0
lead(ga) = 0
lead(cf) = 0

Model 1: restricted model
Model 2: log(tc) ~ ftestu + ftgrad + tt + ga + cf + factor(year) + lead(ftestu) + 
    lead(ftgrad) + lead(tt) + lead(ga) + lead(cf)

  Res.Df Df     F Pr(>F)
1    134                
2    129  5 0.661  0.653

The coefficients on the lead values are not jointly significant, which means we fail to reject the null hypothesis of strict exogeneity.

Discussion Slides

Download Ch. 15 discussion slides (PDF)

Acknowledgements

Thank you to Coleman Cornell for generously sharing his materials with me.