Study Tips

I have received many questions about how to study for this course. I will share some tips on how I study, and what I have seen be useful for the students I have taught and tutored in the past.

General to Specific: A Study Progression

A useful principle in test preparation is to move from general to specific as the exam approaches. Early on, your goal is to build strong mental models, you should not be worrying about whether you can solve a specific exam problem, but rather whether you understand the big picture. What is a regression actually doing? Why do we care about standard errors? You want to build skills that are as general as possible as well as building two things: intuition and technique.

As you move into the middle of the course, you should start sharpening your skills through drills. Now that you have a foundation, you can practice specific problem types and identify gaps in your understanding. Try to find the “edge” of your understanding, and push the frontier. These should be problems where you feel good about almost everything, except one part. Then try to understand how that part fits into the overall picture.

In the final stretch before an exam, shift toward very specific testing sessions. Work through past exams or book problems under realistic conditions, pay attention to time pressure, and get comfortable with the format and style of questions you will actually face. You’re not really “learning” here, you’re testing, in order to figure out where your gaps are and where you need to go back and review weaknesses, as well as build confidence in your strong areas.

Four Types of Study Workouts

Split your study time into the following “workouts”:

Technique: The easiest, most low-level concepts possible
- Notation, concepts, definitions, terms, assumptions
- Examples:
  - “Why do we have an \(i\) sometimes in the formula and sometimes not?”
  - “Why do we have \(b_0\) and \(\beta_0\)?”
Intuition: Graphs and examples to visualize and explain concepts
- Figures of different regressions, graphs with different parameters
- Read articles about the topic, or watch videos explaining it with pictures and as many relatable examples as possible.
- Real-world examples:
  - “What if I regressed wages on test scores?”
  - “What if I regressed socks on shoes?”
Drills: Practice problems with a specific focus
- Computing an integral
- Solving a system of equations
- Solving one part of a regression problem
Highly Specific / Testing: Example exam questions to check your understanding and feel how intense different problems might be.

Topic-Specific Study Resources

Each chapter on this site has “Think:” exercises and worked examples. The discussion problem pages have full solutions to practice problems from the textbook. Here is where to find them:

Topic	Chapter Page	Discussion Problems
Heteroskedasticity	Ch. 8	Discussion Ch. 8
Time Series	Ch. 9	Discussion Ch. 9
Endogeneity & IV	OVB, Measurement Error, IV	Discussion Ch. 10
Simultaneous Equations	Ch. 11	Discussion Ch. 11
Panel Data	Ch. 15	Discussion Ch. 15
Qualitative/Limited DV	Ch. 16	Discussion Ch. 16

Predict Exam Questions

Below is a demonstration of how to predict exam questions by reading the textbook carefully. I use Chapter 10 (endogeneity and IV) as the example, but you should apply this same process to every chapter.

I will start on page 481 of HGL and as I go through the first couple of pages, I will try to think of exam questions that could be asked.

“In this chapter, we relax the exogeneity assumption. When an explanatory variable is random, the properties of the least squares estimator depend on the characteristics of the independent variable \(x\). The assumption of strict exogeneity is SR2 in the simple regression model, \(E(e_i|\mathbf{x}) = 0\), and it is MR2 in the multiple regression model, \(E(e_i|\mathbf{X}) = 0\). The mathematical form of this assumption is simple but the full meaning is complex. In Section 2.10.2, we gave common simple regression model examples when this assumption might fail. In these cases, with an explanatory variable that is endogenous, the usual least squares estimator does not have its desirable properties; it is not an unbiased estimator of the population parameters \(\beta_1, \beta_2, \ldots\); it is not a consistent estimator of \(\beta_1, \beta_2, \ldots\); tests and interval estimators do not have the anticipated properties, and even having large data samples will not cure the problems.”

One way test questions can ask about material is just the most rote way about memorization:

Rote Questions

What is the definition of strict exogeneity?
What is the definition of endogeneity?
Which of SR1-6 refers to strict exogeneity?
What is the definition of a consistent estimator?
What is the definition of an unbiased estimator?

At the bottom of page 483, there are two claims in the last paragraph. A great exam question could just ask you to prove these claims. \[\mathbb{E}(e_i | \mathbf{x}) = 0 \implies \operatorname{Cov}(\mathbf{X}, \mathbf{e}) = 0 \tag{33.1}\]

\[\mathbb{E}(e_i | \mathbf{x}) = 0 \implies \mathbb{E}(e_i) = 0 \tag{33.2}\]

Questions about claims

Prove claim #eq-cov-zero.
Prove claim #eq-mean-zero.
Which is a stronger assumption? (1) \(\mathbb{E}(e_i | \mathbf{x}) = 0\), or (2) \(\mathbb{E}(e_i) = 0\)?
3. They are equivalent.
Which is a stronger assumption? (1) \(\mathbb{E}(e_i | \mathbf{x}) = 0\), or (2) \(\operatorname{Cov}(\mathbf{X}, \mathbf{e}) = 0\)?
3. They are equivalent.
What is the proper ordering of assumptions, from strongest to weakest?
1. Zero covariance, mean independence, strict exogeneity
2. Strict exogeneity, mean independence, zero covariance
3. Mean independence, zero covariance, strict exogeneity
4. Zero covariance, strict exogeneity, mean independence

On page 485, there is a picture of a regression in Figure 10.1 (b). But because we have an endogeneity problem, it’s not recovering the true relationship (illustrated by the solid black line). We could as a question that refers to this type of figure without showing it.

Questions about a figure

When there is omitted variable bias, the line does not fit the data well.
When there is endogeneity, the best fit line underestimates the true relationship.

Question: Which is correct?

Statement (i) is true, statement (ii) is false.
Statement (i) is false, statement (ii) is true.
Statement (i) is true, statement (ii) is true.
Statement (i) is false, statement (ii) is false.

On 486, there is a formula for \(\beta_2\): \[\beta_2 = \frac{\operatorname{cov}(x_i, y_i)}{\operatorname{var}(x_i)} - \frac{\operatorname{cov}(x_i, e_i)}{\operatorname{var}(x_i)}\]

Questions about the \(\beta_2\) formula

What does the second term \(\frac{\operatorname{cov}(x_i, e_i)}{\operatorname{var}(x_i)}\) represent? When does it equal zero?
Which term is affected by the strict exogeneity assumption?
Which of the following is the correct interpretation of this formula?
1. The OLS estimator equals the true \(\beta_2\) plus a bias term that depends on the covariance between \(x_i\) and \(e_i\).
2. The OLS estimator equals the true \(\beta_2\) minus a bias term that depends on the covariance between \(x_i\) and \(e_i\).
3. The OLS estimator always equals the true \(\beta_2\).
4. The bias term depends on the variance of \(e_i\), not the covariance between \(x_i\) and \(e_i\).
Suppose \(\operatorname{cov}(x_i, e_i) > 0\). Which direction is the OLS estimator biased?
1. Upward
2. Downward
3. It is unbiased
4. Cannot be determined without more information
True or false: even with a very large sample size, the bias term \(\frac{\operatorname{cov}(x_i, e_i)}{\operatorname{var}(x_i)}\) does not go away.

Example Questions for Instrumental Variables, Hausman Test, and Sargan Test

IV Basics

Which of the following are the two conditions an instrument \(z\) must satisfy?
1. Relevance: \(\operatorname{cov}(z_i, x_i) \neq 0\), and Exogeneity: \(\operatorname{cov}(z_i, e_i) = 0\)
2. Relevance: \(\operatorname{cov}(z_i, y_i) \neq 0\), and Exogeneity: \(\operatorname{cov}(z_i, e_i) = 0\)
3. Relevance: \(\operatorname{cov}(z_i, x_i) \neq 0\), and Exogeneity: \(\operatorname{cov}(z_i, x_i) = 0\)
4. Relevance: \(\operatorname{cov}(z_i, e_i) \neq 0\), and Exogeneity: \(\operatorname{cov}(z_i, y_i) = 0\)
Which of the IV conditions can be tested empirically, and which cannot?
1. Relevance can be tested; exogeneity cannot.
2. Exogeneity can be tested; relevance cannot.
3. Both can be tested.
4. Neither can be tested.
Suppose you propose proximity to college as an instrument for education in a wage regression. Is the relevance condition likely satisfied?
1. Yes, because people who grow up near colleges are more likely to attend.
2. No, because proximity to college has no effect on education.
3. It depends on whether proximity affects wages directly.
4. Yes, because proximity to college is uncorrelated with ability.
A valid instrument affects \(y\) only through its effect on \(x\). This is known as:
1. The relevance condition
2. The exclusion restriction
3. The rank condition
4. The order condition
What happens to the IV estimator if the instrument is weak (i.e., \(\operatorname{cov}(z_i, x_i) \approx 0\))?
1. It is unbiased but inefficient.
2. It is consistent but has very large variance.
3. It is biased and inconsistent.
4. It performs the same as OLS.

The Hausman Test

What is the null hypothesis of the Hausman test?
1. The instrument is exogenous.
2. OLS and IV estimates are not significantly different, implying \(x\) is exogenous.
3. OLS and IV estimates are not significantly different, implying \(x\) is endogenous.
4. The model is correctly specified.
If you reject the null hypothesis of the Hausman test, what do you conclude?
1. The instrument is invalid.
2. \(x\) is endogenous and IV is preferred over OLS.
3. OLS is unbiased.
4. The model is misspecified.
If you fail to reject the Hausman test, does that prove OLS is unbiased?
1. True
2. False
3. It depends on the sample size
The Hausman test can be used to determine whether your instrument is exogenous.
1. True
2. False
3. It depends
Even if the Hausman test fails to reject, you might still prefer IV if:
1. You have a strong prior that \(x\) is endogenous based on economic theory.
2. Your sample size is very large.
3. OLS standard errors are smaller than IV standard errors.
4. The instrument passes the Sargan test.

The Sargan Test

What is the Sargan test used for, and when can it be applied?
1. Testing whether the endogenous variable is truly endogenous.
2. Testing overidentifying restrictions, whether the instruments are exogenous, when you have more instruments than endogenous variables.
3. Testing whether the instruments are relevant.
4. Testing whether OLS or IV is more efficient.
What is the null hypothesis of the Sargan test?
1. All instruments are relevant.
2. All instruments are exogenous.
3. The endogenous variable is truly endogenous.
4. OLS and IV estimates are equal.
If you reject the null of the Sargan test, what does that imply?
1. All instruments are valid.
2. At least one instrument is likely endogenous.
3. The model is correctly specified.
4. The endogenous variable is exogenous.
If you have exactly one instrument for one endogenous variable, you can use the Sargan test to check instrument validity.
1. True
2. False
3. It depends on the sample size
Suppose you have two instruments and one endogenous variable. You run the Sargan test and fail to reject. Does this guarantee both instruments are valid?
1. True, failing to reject confirms both instruments are exogenous.
2. False, failing to reject only means the data are consistent with both being valid, not that they are guaranteed to be.
3. It depends on whether the instruments are correlated with each other.

Types of Exam Questions

Not all multiple choice questions are the same. When you sit down to write practice questions for yourself, or when you are reading through a question on an exam, it helps to recognize what kind of knowledge is actually being tested. Here are the main types you will see.

Rote / Definition. These are the most straightforward. You either know the definition or you don’t. “What is the strict exogeneity assumption?” “Which of SR1-6 says the errors have mean zero?” There is no trick here. The best preparation is to write out definitions in your own words until they feel natural and you can represent them clearly, or explain them to a lay person.

Implication. These ask: given that some condition holds, what follows? The question is not asking you to define strict exogeneity necessarily, but it is asking you what happens to the bias term if (or if and only if) strict exogeneity holds. To get comfortable with these, practice tracing the logical chain from assumption to conclusion. If A, then B. If B, then C. What does C look like?

Identification. A researcher proposes proximity to college as an instrument for education. Which IV condition is hardest to verify? These questions ask you to take a real-world scenario and map it onto the framework you have learned. You need to understand the concept well enough to recognize it in a new context, not just recite it.

True/False with nuance. These are designed to catch students who have a surface-level understanding. The wrong answers are usually almost right, they contain a grain of truth but get one detail wrong. When you see these, slow down. Ask yourself: “Is this always true, or only sometimes? Does it depend on something?”

Ordering / ranking. These ask you to place things in order like assumptions from strongest to weakest, estimators from most to least efficient, conditions from most to least testable. They require you to understand the relationships between concepts. A good way to study for these is to draw out a diagram or hierarchy of the concepts you have learned.

Interpreting code and output. You will often be shown a chunk of R output, a regression table, a test statistic, a p-value, and asked what it means. These questions are not asking you to run the code yourself, they are asking whether you can read what the computer is telling you. Can you look at a coefficient and say whether it is statistically significant? Can you look at an F-statistic from a first-stage regression and say whether the instrument is strong? The best way to prepare is to get comfortable reading summary() output and knowing exactly what each number represents and where it comes from. Examine each object in the code and understand why it is there, what are the parameters, etc.

Scenario / applied. You run a Hausman test and reject the null. What do you conclude, and what estimator do you use? These are the hardest because they require you to synthesize multiple concepts at once, under time pressure. The best way to prepare is to work through past exam problems out loud, narrating your reasoning as you go, again try to explain it clearly to someone else in the course and try to think of exceptions to the rule.