Final Exam Questions
Practice Problems for the Final Exam
Ch 8–9: Heteroskedasticity & Time Series
Question 1
Consider the infinite lag representation \(y_t = \alpha + \sum_{s=0}^{\infty} \beta_s x_{t-s} + e_t\) for the ARDL model:
\[y_t = \delta + \theta_1 y_{t-1} + \theta_3 y_{t-3} + \delta_1 x_{t-1} + v_t\]
Find an expression for \(\alpha\).
- \(0\)
- \(\delta_1\)
- \(\dfrac{\delta}{1 - \theta_1}\)
- \(\dfrac{\delta}{1 - \theta_1 - \theta_3}\)
- None of the above
Correct Answer: (d)
Write the ARDL model using the lag operator \(L\):
\[(1 - \theta_1 L - \theta_3 L^3) y_t = \delta + \delta_1 L x_t + v_t\]
The infinite lag representation is obtained by inverting:
\[y_t = (1 - \theta_1 L - \theta_3 L^3)^{-1} (\delta + \delta_1 L x_t)\]
Equating constant terms between the two representations: set \(L = 1\) (steady state) in the lag polynomial:
\[\delta = (1 - \theta_1 - \theta_3)\alpha \implies \alpha = \frac{\delta}{1 - \theta_1 - \theta_3}\]
Reference: Textbook §9.1 (ARDL models); Prof. Notes Ch. 9, “Infinite Distributed Lag Representation.” Adapted from Spring 2024 Q5.
Question 2
Given a time series plot, ACF, and PACF where:
- The time series oscillates around a constant mean with no apparent trend
- The ACF decays to zero quickly (within 2–3 lags)
- The PACF shows significant spikes at lags 1, 2, and 3, then cuts off
Which statements are correct?
- The ACF shows persistence, suggesting non-stationarity
- The PACF spikes at lags 1, 2, and 3 indicate an AR(3) component
- The time series plot suggests stationarity (no trend or seasonality)
- The series appears stationary as the ACF decays to zero quickly
- and (ii) only
- and (iii) only
- and (iv) only
- and (iv) only
- (ii), (iii), and (iv) only
Correct Answer: (e)
- (i) is wrong: Quick ACF decay means no persistence — the series is stationary
- (ii) is correct: PACF cuts off after lag 3 — indicates an AR(3) process
- (iii) is correct: No trend or seasonality in the time series plot — stationary
- (iv) is correct: Fast ACF decay is a hallmark of stationarity
Reading ACF/PACF plots:
- ACF decays slowly \(\implies\) non-stationary or near-unit-root
- PACF cuts off at lag \(p\) \(\implies\) AR(\(p\)) model
- ACF cuts off at lag \(q\) \(\implies\) MA(\(q\)) model
Reference: Textbook §9.3 (ACF/PACF interpretation); Prof. Notes Ch. 9, “§3 Identifying Time Series Models.” Adapted from Spring 2024 Q6.
Question 3
Given the following sample autocorrelations with \(T = 680\) observations:
| Lag | \(r_k\) |
|---|---|
| 1 | 0.32 |
| 2 | \(-0.91\) |
| 3 | 0.08 |
| 4 | \(-0.01\) |
Which lags are statistically significant at the 5% level? (Use \(z_{0.975} = 1.96\).)
- Lag 1 only
- Lags 1 and 2 only
- Lags 1 and 3 only
- Lags 1, 2, and 3 only
- Lags 1, 2, 3, and 4
Correct Answer: (d)
The test statistic for each lag is \(r_k \times \sqrt{T}\), compared against \(\pm 1.96\):
| Lag | \(r_k\) | \(r_k \times \sqrt{680} \approx r_k \times 26.08\) | Significant? |
|---|---|---|---|
| 1 | 0.32 | 8.35 | Yes |
| 2 | \(-0.91\) | \(-23.73\) | Yes |
| 3 | 0.08 | 2.09 | Yes (\(> 1.96\)) |
| 4 | \(-0.01\) | \(-0.26\) | No |
Lags 1, 2, and 3 are significant; lag 4 is not. Note that lag 3 barely exceeds the critical value.
Reference: Textbook §9.3 (testing autocorrelation significance); Prof. Notes Ch. 9, “§3 Testing Individual Autocorrelations.” From Spring 2024 Q7.
Ch 10–11: IV/Endogeneity & Simultaneous Equations
Question 4
Consider the system of simultaneous equations for the demand and supply of a panini:
\[Q_d = \alpha_0 + \alpha_1 P + \alpha_2 Y + \alpha_3 Z + \alpha_4 F + u_d\] \[Q_s = \beta_0 + \beta_1 P + \beta_2 W + \beta_3 S + u_s\]
where \(P\) = price, \(Y\) = spending limit, \(W\) = cost of production, \(Z\) = price of a poke bowl, \(S\) = salmon fishing season, \(F\) = final exam time.
Given the reduced form equations:
\[P = \gamma_0 + \gamma_1 Y + \gamma_2 W + \gamma_3 Z + \gamma_4 S + \gamma_5 F + v_1\]
What is the functional form of \(\gamma_5\)?
\(\dfrac{-\alpha_4}{\alpha_1 - \beta_1}\)
\(\dfrac{-\alpha_3}{\alpha_1 - \beta_1}\)
\(\dfrac{\beta_2}{\alpha_1 - \beta_1}\)
\(\dfrac{\beta_0}{\alpha_1 - \beta_1}\)
None of the given answers are correct
Correct Answer: (a)
Set \(Q_d = Q_s\) and solve for \(P\):
\[\alpha_0 + \alpha_1 P + \alpha_2 Y + \alpha_3 Z + \alpha_4 F + u_d = \beta_0 + \beta_1 P + \beta_2 W + \beta_3 S + u_s\]
\[(\alpha_1 - \beta_1) P = (\beta_0 - \alpha_0) + \beta_2 W + \beta_3 S - \alpha_2 Y - \alpha_3 Z - \alpha_4 F + (u_s - u_d)\]
The coefficient on \(F\) in the reduced form for \(P\) is:
\[\gamma_5 = \frac{-\alpha_4}{\alpha_1 - \beta_1}\]
Each reduced-form coefficient is a ratio of structural parameters.
Reference: Textbook §11.2 (“The Reduced-Form Equations”); Prof. Notes Ch. 11, “§3 The Reduced-Form Equations.” Adapted from Spring 2024 Q11.
Question 5
Suppose the regression of Price on no independent variables (intercept-only model) has \(SSE = 1800\). The first-stage regression (Price on all exogenous variables and instruments) has \(SSE = 1750\). The sample size is \(N = 400\), and there are \(J = 5\) instruments and \(K = 6\) total parameters in the first-stage regression.
What can be said about the instruments?
- The F-statistic is 2.25 and the instruments are strong
- The F-statistic is 2.25 and the instruments are weak
- The F-statistic is 12.52 and the instruments are strong
- The F-statistic is 12.52 and the instruments are weak
- We need more information to solve the question
Correct Answer: (b)
Using the F-statistic formula:
\[F = \frac{(SSE_r - SSE_u)/J}{SSE_u/(N - K)} = \frac{(1800 - 1750)/5}{1750/(400 - 6)} = \frac{50/5}{1750/394} = \frac{10}{4.44} = 2.25\]
Since \(F = 2.25 < 10\), the instruments are jointly weak.
The \(F > 10\) rule of thumb (Staiger & Stock):
- \(F > 10\): instruments are strong enough for reliable IV estimation
- \(F < 10\): weak instruments — IV estimates are biased toward OLS, confidence intervals are unreliable
Reference: Textbook §10.4 (“Weak Instruments,” Staiger & Stock rule); Prof. Notes Ch. 10, “§4.1 The Strength of Instruments.” From Spring 2024 Q12.
Question 6
A researcher estimates a model by IV/2SLS using \(L = 3\) instruments for \(B = 1\) endogenous variable. The diagnostic output is:
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 3 420 42.100 <2e-16 ***
Wu-Hausman 1 420 5.831 0.0162 *
Sargan 2 NA 1.204 0.5478
At the 5% significance level, which set of conclusions is correct?
- Instruments are weak; endogeneity is present; surplus instruments are valid
- Instruments are strong; no evidence of endogeneity; surplus instruments are invalid
- Instruments are strong; endogeneity is present; surplus instruments appear valid
- Instruments are strong; endogeneity is present; surplus instruments are invalid
Correct Answer: (c)
Interpret each test:
- Weak instruments (\(F = 42.1 \gg 10\)): Strong instruments. \(\checkmark\)
- Wu-Hausman (\(p = 0.0162 < 0.05\)): Reject \(H_0\): OLS is consistent \(\implies\) evidence of endogeneity. Use IV. \(\checkmark\)
- Sargan (\(p = 0.5478 > 0.05\), \(df = L - B = 3 - 1 = 2\)): Fail to reject \(H_0\): surplus instruments are valid. \(\checkmark\)
The 3-test decision tree for IV:
- Are instruments strong? (\(F > 10\))
- Is endogeneity present? (Hausman \(p < 0.05\) \(\implies\) use IV)
- Are surplus instruments valid? (Sargan \(p > 0.05\) \(\implies\) valid)
Reference: Textbook §10.4–10.5 (instrument strength, Hausman, Sargan); Prof. Notes Ch. 10, “§4 Specification Tests.” Original question combining all three IV diagnostics.
Question 7
Consider a three-equation simultaneous system (\(M = 3\)):
\[Y_1 = \alpha_0 + \alpha_1 Y_2 + \alpha_2 Y_3 + \alpha_3 X_1 + u_1\] \[Y_2 = \beta_0 + \beta_1 Y_1 + \beta_2 X_1 + \beta_3 X_2 + u_2\] \[Y_3 = \gamma_0 + \gamma_1 Y_1 + \gamma_2 X_2 + \gamma_3 X_3 + u_3\]
where \(Y_1, Y_2, Y_3\) are endogenous and \(X_1, X_2, X_3\) are exogenous. Using the order condition (\(\geq M - 1 = 2\) excluded exogenous variables per equation), which equations are identified?
- Only equation 1
- Only equations 2 and 3
- All three equations
- None of the equations
Correct Answer: (a)
Check the order condition for each equation (need \(\geq M - 1 = 2\) excluded exogenous variables):
- Eq. 1: Includes \(X_1\). Excludes \(X_2, X_3\) \(\implies\) 2 excluded \(\geq 2\). \(\checkmark\) Just-identified.
- Eq. 2: Includes \(X_1, X_2\). Excludes \(X_3\) only \(\implies\) 1 excluded \(< 2\). \(\times\) Under-identified.
- Eq. 3: Includes \(X_2, X_3\). Excludes \(X_1\) only \(\implies\) 1 excluded \(< 2\). \(\times\) Under-identified.
Only equation 1 satisfies the order condition. Equations 2 and 3 are under-identified because they each exclude only 1 exogenous variable, but with \(M = 3\) equations we need at least 2 exclusions.
Reference: Textbook §11.4 (“Order Condition”); Prof. Notes Ch. 11, “§4.2 A Necessary Condition for Identification.” Adapted from Spring 2021 Q25–26.
Ch 15: Panel Data
Question 8
A researcher estimates a panel data model and runs an F-test for individual effects. The output is:
F test for individual effects
data: lsales ~ lcapital + llabor
F = 14.386, df1 = 999, df2 = 1998,
p-value < 2.2e-16
alternative hypothesis: significant effects
At the 5% significance level, what is the conclusion?
- Since the p-value is small, we reject \(H_0\) of no fixed effects \(\implies\) individual effects exist
- Since the p-value is small, we fail to reject \(H_0\) of no fixed effects
- Since the p-value is small, we reject \(H_0\) of zero variance of individual-specific errors
- Since the p-value is small, we fail to reject \(H_0\) of zero variance of individual-specific errors
Correct Answer: (a)
The F-test (pFtest) compares pooled OLS vs. fixed effects:
- \(H_0\): All individual effects are zero (pooled OLS is adequate)
- \(H_1\): At least some individual effects are non-zero (need FE or RE)
Since \(p < 2.2 \times 10^{-16} < 0.05\), we reject \(H_0\) \(\implies\) individual effects exist, so pooled OLS is not appropriate.
This is Step 1 of the panel model selection workflow:
- F-test: Pooled OLS vs. FE \(\implies\) Do individual effects exist?
- LM test: Pooled OLS vs. RE \(\implies\) Is there random variation?
- Hausman test: FE vs. RE \(\implies\) Is there endogeneity?
Reference: Textbook §15.4 (“Testing for Fixed Effects”); Prof. Notes Ch. 15, “§4 Testing for Individual Effects.” Adapted from Fall 2022 Q15.
Question 9
A researcher runs two additional tests on the same panel data:
Lagrange Multiplier Test - (Honda)
data: lsales ~ lcapital + llabor
normal = 44.064, p-value < 2.2e-16
alternative hypothesis: significant effects
Hausman Test
data: lsales ~ lcapital + llabor
chisq = 98.817, df = 2, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent
Given these results (and the F-test from Q8), which model should we use?
- Pooled OLS
- Fixed Effects
- Random Effects
- Hausman-Taylor Estimator
Correct Answer: (b)
Walk through the decision tree:
- F-test (\(p < 2.2 \times 10^{-16}\)): Reject \(H_0\) \(\implies\) individual effects exist. Rule out pooled OLS.
- LM test (\(p < 2.2 \times 10^{-16}\)): Reject \(H_0\): \(\sigma_u^2 = 0\) \(\implies\) random effects are present.
- Hausman test (\(p < 2.2 \times 10^{-16}\)): Reject \(H_0\): \(\text{Cov}(u_i, x_{it}) = 0\) \(\implies\) individual effects are correlated with regressors \(\implies\) RE is inconsistent.
Both tests confirm individual effects, and the Hausman test tells us there is endogeneity, so fixed effects is the correct model.
Rejecting the Hausman test always points to FE (or Hausman-Taylor if you need time-invariant covariates).
Reference: Textbook §15.4–15.5 (F, LM, Hausman tests); Prof. Notes Ch. 15, “§4–5 Model Selection Tests.” Adapted from Fall 2022 Q15–17.
Question 10
In a panel data model, the variance of the individual heterogeneity is \(\sigma_u^2 = 0.8\) and the variance of the idiosyncratic error is \(\sigma_e^2 = 0.05\).
What is the intra-class correlation \(\rho\), i.e., the correlation between the composite error of two observations from the same individual in different time periods?
\[\rho = \text{Corr}(w_{it}, w_{is}) \quad \text{where } w_{it} = u_i + e_{it}\]
- 0.059
- 0.941
- 0.484
- None of the above
Correct Answer: (b)
The intra-class correlation formula:
\[\rho = \frac{\sigma_u^2}{\sigma_u^2 + \sigma_e^2} = \frac{0.8}{0.8 + 0.05} = \frac{0.8}{0.85} = 0.941\]
Interpretation: 94.1% of the total error variance is due to individual heterogeneity, and only 5.9% is idiosyncratic. Observations within the same individual are highly correlated.
- High \(\rho\) \(\implies\) strong individual effects \(\implies\) pooled OLS is badly inefficient
- \(\rho\) close to 1 \(\implies\) most variation is between individuals, not within
- \(\rho = 0\) \(\implies\) no individual effects, pooled OLS is fine
Reference: Textbook §15.2 (“The Error Components Model,” intra-class correlation); Prof. Notes Ch. 15, “§2 Error Components.” From Spring 2024 Q19 / Fall 2022 Q19.
Question 11
Under what conditions would it be appropriate to use a Hausman-Taylor estimator?
- When the independent variables are strictly exogenous and there are no individual effects
- When we need to estimate the effects of both time-changing and time-invariant variables in a panel data setting, especially when some of the time-changing variables are endogenous
- When the data is purely cross-sectional with no time component
- When all variables are time-varying and there is no correlation with individual effects
- When using pooled OLS to estimate a panel data model with no fixed effects
Correct Answer: (b)
The Hausman-Taylor estimator addresses a specific dilemma:
- Fixed effects removes all time-invariant variables (including ones we care about, like gender or race)
- Random effects keeps time-invariant variables but is inconsistent if \(\text{Cov}(u_i, x_{it}) \neq 0\)
Hausman-Taylor combines both approaches: it uses the time-varying exogenous variables as instruments for the endogenous time-invariant variables.
Use Hausman-Taylor when:
- The Hausman test rejects RE (endogeneity present)
- You want to estimate coefficients on time-invariant variables
- Some time-varying regressors are exogenous (to serve as instruments)
Reference: Textbook §15.6 (“Hausman-Taylor Estimator”); Prof. Notes Ch. 15, “§6 The Hausman-Taylor Estimator.” From Spring 2024 Q23.
Question 12
Which of the following statements about fixed effects estimation is false?
- The within estimator and the LSDV (Least Squares Dummy Variable) estimator produce identical slope coefficients
- A disadvantage of LSDV is that estimating many dummy variable coefficients uses up degrees of freedom
- Fixed effects models can estimate the effect of time-invariant variables like gender or race
- The within transformation subtracts each group’s mean from each observation, eliminating the individual effect \(u_i\)
Correct Answer: (c)
- (a) is true: Within estimator and LSDV are algebraically equivalent for slope coefficients. LSDV additionally produces estimates of the individual intercepts.
- (b) is true: With \(N\) individuals, LSDV adds \(N - 1\) dummy variables \(\implies\) large loss of degrees of freedom.
- (c) is FALSE: The within transformation subtracts group means, which eliminates all time-invariant variables along with \(u_i\). This is the fundamental limitation of FE.
- (d) is true: \(y_{it} - \bar{y}_i = \beta(x_{it} - \bar{x}_i) + (e_{it} - \bar{e}_i)\) — the individual effect \(u_i\) drops out.
If you need coefficients on time-invariant variables, use RE (if exogenous) or Hausman-Taylor (if endogenous).
Reference: Textbook §15.3–15.4 (within estimator, LSDV); Prof. Notes Ch. 15, “§3 The Fixed Effects Estimator.” Original question synthesizing panel FE concepts.
Ch 16: Qualitative & Limited Dependent Variables
Question 13
We model college attendance (\(\text{psechoice\_b} = 1\) if attends) using parcoll (parent graduated college) and faminc (family income in $1000s). From \(N = 749\) observations:
LPM: \(\hat{P} = 0.546 + 0.256 \cdot parcoll + 0.00134 \cdot faminc\)
Logit: \(\hat{z} = -0.070 + 1.515 \cdot parcoll + 0.0124 \cdot faminc\)
What is the predicted probability for a student whose family earns $100,000 and no parent graduated from college (\(parcoll = 0\))?
- LPM: 67.0%; Logit: 76.3%
- LPM: 68.0%; Logit: 76.3%
- LPM: 80.2%; Logit: 82.4%
- LPM: 67.0%; Logit: 12.4%
Correct Answer: (b)
LPM (probability is the linear prediction directly):
\[\hat{P} = 0.546 + 0.256(0) + 0.00134(100) = 0.546 + 0.134 = 0.680 = 68.0\%\]
Logit (apply the logistic CDF \(\Lambda(z) = \frac{1}{1 + e^{-z}}\)):
\[z = -0.070 + 1.515(0) + 0.0124(100) = -0.070 + 1.24 = 1.17\] \[\hat{P} = \frac{1}{1 + e^{-1.17}} = \frac{1}{1 + 0.310} = 0.763 = 76.3\%\]
In the LPM, the coefficient is the marginal effect. In the logit, you must transform through \(\Lambda(\cdot)\) to get a probability.
Reference: Textbook §16.1–16.3 (LPM, logit); Prof. Notes Ch. 16, “§1–3 Binary Choice Models.” Adapted from Fall 2022 Q2 and Q4.
Question 14
We model whether someone buys an item (\(Y = 1\)) based on advertising exposure (in minutes). The logit model estimates are:
\[\log\left(\frac{P}{1-P}\right) = -0.8 + 0.15 \times Advertising\]
What is the marginal effect of advertising on the probability of buying at \(Advertising = 30\) minutes?
- 0.0035
- 0.0033
- 0.0048
- 0.0042
- None of the given answers are correct
Hint: You can compute this two ways: (1) \(\Lambda'(z) \cdot \beta\) or (2) \(P(31) - P(30)\).
Correct Answer: (a) or (b) — both accepted
Method 1: Analytical marginal effect \(= \Lambda'(z) \cdot \beta\)
\(z = -0.8 + 0.15(30) = 3.7\)
\[ME = \frac{e^{-3.7}}{(1 + e^{-3.7})^2} \times 0.15 = \frac{0.02472}{(1.02472)^2} \times 0.15 \approx 0.00353\]
Method 2: Discrete difference \(P(31) - P(30)\)
\(P(30) = \frac{1}{1 + e^{-3.7}} = 0.97589\); \(\quad P(31) = \frac{1}{1 + e^{-3.85}} = 0.97920\)
\[P(31) - P(30) = 0.97920 - 0.97589 = 0.0033\]
Both methods are valid. The analytical derivative gives \(\approx 0.0035\); the discrete change gives \(\approx 0.0033\). The marginal effect is small because \(P\) is already close to 1 (the logistic curve is flat at extreme values).
Reference: Textbook §16.3.1 (“Marginal Effects in the Logit Model”); Prof. Notes Ch. 16, “§3 Marginal Effects.” From Spring 2024 Q29.
Question 15
In the logistic regression model, the log odds of buying an item are given by:
\[\log\left(\frac{P}{1-P}\right) = 0.8 + 0.15 \times Advertising\]
What does a log odds value of 2 mean in terms of the probability of buying the item?
- The probability of buying the item is 0.88
- The probability of buying the item is 0.90
- The probability of buying the item is 0.95
- The probability of buying the item is 0.97
- We cannot find the solution from the information given
Correct Answer: (a)
If \(\log\left(\frac{P}{1-P}\right) = 2\), then:
\[\frac{P}{1-P} = e^2 \approx 7.389\]
Solve for \(P\):
\[P = e^2 \cdot (1 - P) \implies P(1 + e^2) = e^2 \implies P = \frac{e^2}{1 + e^2} = \frac{7.389}{8.389} \approx 0.881\]
This is the same as applying \(\Lambda(2) = \frac{1}{1 + e^{-2}} = \frac{e^2}{1 + e^2} = 0.88\).
Reference: Textbook §16.2 (“The Logistic Distribution,” logit link function); Prof. Notes Ch. 16, “§2 The Logistic Model.” From Spring 2024 Q30.
Question 16
For each scenario, which model is most appropriate?
Scenario A: UCLA surveys student satisfaction on a scale of 1–5 (Very Dissatisfied to Very Satisfied) and wants to model it based on facility usage and year of study.
Scenario B: A researcher studies the duration of unemployment (weeks). Some individuals find jobs during the study; others are still unemployed when it ends.
- Ordered Logit; LPM
- Multinomial Logit; Ordered Logit
- Tobit; LPM
- Ordered Logit; Tobit
- None of the given combinations are sufficient
Correct Answer: (d)
Scenario A — Ordered Logit:
- DV has a natural ranking (1 < 2 < 3 < 4 < 5)
- Not multinomial: the categories have a meaningful order
- Not LPM: the outcome is not binary
Scenario B — Tobit:
- The DV (weeks unemployed) is continuous but censored — for those still unemployed, we observe a lower bound, not the true duration
- Not OLS: censoring causes OLS estimates to be biased
- Not logit: the outcome is not binary or categorical
Match the model to the data structure: ordered categories \(\implies\) ordered logit; censored continuous data \(\implies\) tobit.
Reference: Textbook §16.4 (Ordered Logit), §16.6 (Tobit); Prof. Notes Ch. 16, “§4–6 Extensions of Binary Choice.” Adapted from Spring 2024 Q33–34.
Question 17
Suppose we are given the following confusion matrix from a Logit model with \(N = 10{,}000\):
| Predicted: Buy (1) | Predicted: Not Buy (0) | |
|---|---|---|
| Actual: Buy (1) | 4000 | 1000 |
| Actual: Not Buy (0) | 500 | 4500 |
If the accuracy of the Probit model is 0.80 and the accuracy of the LPM is 0.75, what is the accuracy of the Logit model and which model should be chosen?
- 0.50; Logit
- 0.75; Probit
- 0.80; Probit or Logit
- 0.85; Logit
- 0.90; Logit
Correct Answer: (d)
Accuracy = (correct predictions) / (total predictions):
\[\text{Accuracy} = \frac{TP + TN}{TP + FP + FN + TN} = \frac{4000 + 4500}{4000 + 1000 + 500 + 4500} = \frac{8500}{10000} = 0.85\]
Comparing all three models:
- LPM: 0.75
- Probit: 0.80
- Logit: 0.85 \(\leftarrow\) highest accuracy
Choose the Logit model based on accuracy.
Confusion matrix components:
- \(TP = 4000\) (correctly predicted buy), \(TN = 4500\) (correctly predicted not buy)
- \(FP = 500\) (predicted buy but didn’t), \(FN = 1000\) (predicted not buy but did)
Reference: Prof. Notes Ch. 17, “Model Evaluation: Confusion Matrix and Accuracy.” Original question combining confusion matrix with model comparison.
Ch 17: Regularization & Machine Learning
Question 18
Which of the following best defines bias in the context of machine learning models?
- The error introduced in a model due to excessive sensitivity to small fluctuations in the training data
- The variability of model predictions for a given data point or value, indicating the spread of model predictions
- The error that occurs when a model is too simple to capture the underlying patterns in the data
- The ability of a model to perform well on unseen data by balancing complexity and simplicity
- The tendency of a model to memorize the training data rather than generalize from it
Correct Answer: (c)
The three components of expected prediction error:
\[E[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\]
- Bias (c): Error from simplifying assumptions. A model that is too simple misses real patterns \(\implies\) underfitting.
- Variance (a)/(b): Error from sensitivity to training data. A model that is too complex fits noise \(\implies\) overfitting.
- Irreducible error: Random noise in the data that no model can eliminate.
- describes the goal of balancing bias and variance (generalization).
- describes overfitting (high variance, low bias).
The bias-variance tradeoff: increasing model complexity decreases bias but increases variance. The optimal model minimizes total error.
Reference: Textbook §17.1 (“Bias-Variance Tradeoff”); Prof. Notes Ch. 17, “§1 The Bias-Variance Decomposition.” From Spring 2024 Q36.
Question 19
Which of the following statements correctly describe LASSO regression?
- It reduces variance at the expense of higher bias
- It reduces bias at the expense of higher variance
- It uses a penalty term of the form \(\lambda \sum |\beta_j|\)
- It uses a penalty term of the form \(\lambda \sum \beta_j^2\)
- and (iii) only
- and (iv) only
- and (iv) only
- and (iii) only
- None of the given combinations correctly describe LASSO regression
Correct Answer: (a)
LASSO (Least Absolute Shrinkage and Selection Operator):
- (i) Correct: Shrinking coefficients reduces variance but introduces bias (tradeoff)
- (iii) Correct: LASSO penalty is \(\lambda \sum |\beta_j|\) (L1 norm / absolute value)
LASSO vs. Ridge comparison:
| LASSO | Ridge | |
|---|---|---|
| Penalty | \(\lambda \sum |\beta_j|\) (L1) | \(\lambda \sum \beta_j^2\) (L2) |
| Feature selection? | Yes (sets coefficients to exactly 0) | No (shrinks toward 0) |
| Best when | Many irrelevant predictors | Many small effects |
Both LASSO and Ridge reduce variance at the cost of bias. The difference is that LASSO performs automatic feature selection.
Reference: Textbook §17.3 (“LASSO and Ridge Regression”); Prof. Notes Ch. 17, “§3 Regularization Methods.” Adapted from Spring 2024 Q38–39.
Question 20
For LASSO Regression, if the tuning parameter \(\lambda = 0\), what does it mean?
- The loss function is the same as the ordinary least squares loss function
- The LASSO regression turns into a Ridge regression model
- It shrinks the coefficients of less important features to exactly 0
- The regularization term becomes infinitely large, eliminating all features
- None of the given answers are true of LASSO if the tuning parameter \((\lambda) = 0\)
Correct Answer: (a)
The LASSO loss function is:
\[\min_\beta \sum_{i=1}^n (y_i - x_i'\beta)^2 + \lambda \sum_{j=1}^p |\beta_j|\]
When \(\lambda = 0\):
- The penalty term \(\lambda \sum |\beta_j| = 0 \cdot \sum |\beta_j| = 0\)
- The loss function becomes \(\sum (y_i - x_i'\beta)^2\) \(\implies\) ordinary least squares
- No shrinkage occurs \(\implies\) all coefficients are unrestricted
What happens as \(\lambda\) changes:
- \(\lambda = 0\): OLS (no regularization)
- Small \(\lambda\): slight shrinkage, a few coefficients may hit zero
- Large \(\lambda\): heavy shrinkage, most/all coefficients forced to zero
- \(\lambda \to \infty\): all coefficients \(= 0\) (intercept-only model)
Reference: Textbook §17.3 (“The Tuning Parameter \(\lambda\)”); Prof. Notes Ch. 17, “§3 Regularization Methods.” From Spring 2024 Q40.