14 Qualitative and Limited Dependent Variable Models
Binary, Ordered, Multinomial, Count, Censored, and Truncated Outcomes
This chapter is a hub for the full range of models for qualitative and limited dependent variables. It introduces the common thread — why OLS fails for non-continuous outcomes — and provides a roadmap to the sub-pages that develop each model family in detail.
You should be comfortable with OLS regression, conditional expectation, and the idea of maximum likelihood estimation before reading this chapter.
Everything we’ve done so far assumes the dependent variable is continuous and unbounded. But many economic outcomes aren’t like that:
- People choose whether to work or not (binary)
- Students pick a college major (unordered categorical)
- Credit agencies assign bond ratings (ordered categorical)
- Researchers count the number of patents a firm files (count)
- Some people report zero hours worked because they don’t participate in the labor force (censored)
For all of these, OLS is the wrong tool. This chapter introduces the right ones.
14.1 Why OLS Fails for Non-Continuous Outcomes
Consider modeling whether someone drives to work (\(y = 1\)) or takes the bus (\(y = 0\)). If we just run OLS, we get the linear probability model (LPM):
\[ y_i = \beta_1 + \beta_2 x_i + e_i \]
The fitted values \(\hat{y}\) are interpreted as probabilities, but the model has structural problems:
Predictions outside \([0, 1]\): OLS can predict \(\hat{y} = -0.3\) or \(\hat{y} = 1.4\). Neither is a valid probability.
Constant marginal effects: OLS assumes a one-unit change in \(x\) always changes the probability by \(\beta_2\). But probabilities are bounded — the effect near \(P = 0.5\) should be larger than near \(P = 0\) or \(P = 1\).
Heteroskedasticity: When \(y\) is binary, \(\text{Var}(y \mid x) = P(1-P)\) depends on \(x\), violating homoskedasticity by construction.
These problems are not unique to binary outcomes. Similar structural mismatches arise across the board:
- Count data: OLS can predict \(\hat{y} = -1.7\) doctor visits. Negative counts are impossible, and the additive marginal effect ignores the multiplicative structure of count processes.
- Ordered categories: Coding “poor/fair/good” as 1/2/3 and running OLS assumes equal spacing between categories and can predict fractional or out-of-range values.
- Censored data: When many observations pile up at zero (e.g., hours worked), OLS on the full sample flattens the slope, while OLS on positives only introduces selection bias.
The solution in each case is to specify a model that respects the structure of the dependent variable. The sub-pages below develop each one in detail.
\(\hat{p} = -0.2 + 0.08 \times 2 = -0.04\). A negative probability. And for EDUC = 20: \(\hat{p} = -0.2 + 0.08 \times 20 = 1.4\). A probability above 1. Both are impossible. The LPM’s linear structure simply cannot respect the \([0, 1]\) bounds. Probit and logit solve this by passing the linear index through a CDF that maps any real number to a valid probability.
14.2 The Unifying Idea: Latent Variables
Many of these models share a common framework. Behind the observed (discrete or censored) outcome is an unobserved latent variable \(y^*\) that is continuous:
\[ y_i^* = x_i'\beta + e_i \]
We observe a transformation of \(y^*\):
- Binary choice: We observe \(y = 1\) when \(y^* > 0\) (the person works when the net benefit is positive). The distribution of \(e\) determines the model: normal errors give probit, logistic errors give logit.
- Ordered choice: We observe which interval \(y^*\) falls into, with cutpoints \(\mu_1 < \mu_2 < \ldots\) estimated from the data.
- Tobit: We observe \(y^*\) directly when it is positive, but only observe \(y = 0\) when \(y^* \leq 0\).
- Heckman selection: Two latent variables — one governing participation, one governing the outcome — with correlated errors.
Count data models (Poisson, negative binomial) use a different foundation — a log-link function and distributional assumptions on the count process — but share the same principle: match the statistical model to the data-generating process.
14.3 Estimation: Why Not OLS?
All of the models in this chapter are nonlinear in the parameters. We cannot estimate them with OLS because the relationship between the dependent variable and the regressors passes through a nonlinear function (a CDF, an exponential, or a censoring rule). Instead, we use maximum likelihood estimation (MLE): find the parameter values that make the observed data most likely given the model.
MLE has desirable large-sample properties — consistency, asymptotic normality, and efficiency — but comes with a cost: it requires specifying the distribution of the errors. If the distributional assumption is wrong (e.g., assuming normal errors in probit when the true distribution has heavier tails), the estimates may be inconsistent. This is a stronger requirement than OLS, where consistency holds without distributional assumptions. Each sub-page discusses the specific distributional assumptions for its model and what happens when they are violated.
14.4 Common Themes Across Models
Several ideas recur throughout the sub-pages:
Marginal effects are not coefficients. In every nonlinear model, the raw coefficient \(\beta_k\) does not directly tell you the effect of a one-unit change in \(x_k\) on the outcome. The marginal effect depends on where the observation sits in the distribution. Applied work reports either the average marginal effect (AME) or the marginal effect at the mean (MEM). Each sub-page derives the specific marginal effect formula for its model and explains how to interpret it.
Model selection follows the dependent variable. The table in the Model Selection Summary below maps each type of dependent variable to the appropriate model. The first step in any analysis is to characterize the outcome — binary, ordered, unordered categorical, count, or censored — and let that structure dictate the model.
Testing and diagnostics. Each model family has its own diagnostic tools:
- Binary choice: Wald and likelihood ratio tests for coefficient significance; McFadden’s pseudo-\(R^2\) and percent correctly predicted for fit
- Multinomial logit: Hausman-McFadden test for IIA
- Count data: Overdispersion test (\(H_0: \alpha = 0\)); Vuong test for zero-inflation
- Tobit: Comparison of OLS-all, OLS-positives, and Tobit slopes as an informal specification check
- Heckman: Significance of the inverse Mills ratio coefficient as a test for selection bias
14.5 Chapter Roadmap
14.5.1 Binary Choice Models
When the outcome is yes/no (admitted/rejected, employed/unemployed, default/no default), logit and probit pass the linear index through an S-shaped CDF to constrain predictions to \([0, 1]\). This is the foundational model for the chapter — most of the other models generalize the same latent variable framework introduced here. The page builds the intuition visually, starting from the LPM’s three failures and moving through the latent variable framework. Topics include:
- Maximum likelihood estimation and why OLS cannot be used
- Marginal effects: average marginal effect (AME) vs. marginal effect at the mean (MEM)
- Odds ratios and the log-odds interpretation of logit coefficients
- The Wald test, likelihood ratio test, and McFadden’s pseudo-\(R^2\)
- When the LPM is an acceptable approximation (Angrist and Pischke’s defense)
14.5.2 Ordered Choice Models
Outcomes like survey ratings (1–5), credit grades (AAA to D), or health status (poor/fair/good/excellent) have a natural ranking but unknown distances between categories. Coding them as integers and running OLS imposes equal spacing and can predict out-of-range values. Ordered probit and logit model a latent continuous variable partitioned by estimated cutpoints, respecting the ordinal structure without assuming cardinality. Topics include:
- Why coding categories as integers and running OLS is wrong
- The cutpoint mechanism: how covariates shift the distribution across all categories simultaneously
- Marginal effects that sum to zero across categories, with middle categories that can move in either direction
- Discrete differences for binary regressors
14.5.3 Multinomial Logit
When individuals choose among three or more unordered alternatives (bus, train, car), separate binary logits fail because probabilities won’t sum to one. Multinomial logit uses a softmax probability across all alternatives, with \(J - 1\) sets of coefficients relative to a base category. Topics include:
- The random utility foundation and softmax probabilities
- Individual-specific variables (MNL) vs. alternative-specific variables (conditional logit)
- Marginal effects that depend on all probabilities and sum to zero across alternatives
- The Independence of Irrelevant Alternatives (IIA) assumption and the red bus/blue bus problem
- Alternatives when IIA fails: nested logit, mixed logit, multinomial probit
14.5.4 Count Data Models
Dependent variables that count events (doctor visits, patents filed, arrests made) are non-negative integers, often right-skewed with many zeros. OLS can predict negative counts and imposes additive effects where multiplicative effects are more natural. Poisson regression models the conditional mean through a log link (\(\mu = e^{X\beta}\)), guaranteeing non-negative predictions with a semi-elasticity interpretation: a one-unit increase in \(x_k\) multiplies the expected count by \(e^{\beta_k}\). Topics include:
- The log link and semi-elasticity interpretation of Poisson coefficients
- Equidispersion (\(E[Y] = \text{Var}(Y)\)) and why real data almost always violate it
- Consequences of ignoring overdispersion: consistent point estimates but unreliable standard errors
- The negative binomial model and its overdispersion parameter \(\alpha\)
- Zero-inflated models for excess zeros
14.5.5 The Tobit Model
When a continuous outcome piles up at a boundary (hours worked at zero, charitable donations at zero, expenditure on luxury goods), the data are censored: we observe everyone in the sample, but the latent variable is clipped at the boundary. The zeros are not missing data — they represent a corner solution. OLS on all observations underestimates the slope because the zeros flatten the regression line; OLS on positives only suffers from selection bias because the subsample of positive observations is non-random. Topics include:
- The censored data likelihood: normal density for positives, CDF for zeros
- Three distinct marginal effects (latent variable, extensive margin, unconditional mean)
- The McDonald-Moffitt decomposition into intensive and extensive margin components
- Censoring vs. truncation: when the zeros are missing entirely, Tobit is inappropriate
- The single-index restriction and when it fails
14.5.6 Heckman Selection
The Tobit model assumes the same parameters govern both the participation decision and the outcome. When this single-index restriction fails — the decision to work may depend on social norms and childcare availability, while hours worked depend on wages and commute time — the Heckman selection model separates the two equations. This connects the limited dependent variable framework back to the endogeneity and selection bias themes from earlier chapters. Topics include:
- Sample selection bias: why OLS on the selected sample is inconsistent
- The inverse Mills ratio as a sufficient statistic for selection bias
- The two-step procedure: probit selection equation, then OLS with the Mills ratio correction
- The exclusion restriction: why identification requires a variable that affects participation but not the outcome
- Full information maximum likelihood as an alternative to two-step
14.6 Model Selection Summary
| Dependent Variable | Model | Sub-page |
|---|---|---|
| Binary (0/1) | LPM, Probit, Logit | Binary Choice |
| Ordered categories | Ordered Probit, Ordered Logit | Ordered Choice |
| Unordered categories (3+) | Multinomial Logit, Conditional Logit | Multinomial Logit |
| Count (0, 1, 2, …) | Poisson, Negative Binomial | Count Data |
| Censored continuous | Tobit | Tobit |
| Truncated continuous | Truncated Regression | — |
| Selected sample | Heckman Selection | Heckman |
14.7 Slide Deck
Follow the sub-page links in the roadmap above for the full treatment of each model family with visual intuition and worked examples. For regularization and prediction methods, see Regularization. To review the panel data methods that precede this chapter, see Panel Data.