16  Ordered Choice Models

Modeling Ordinal Outcomes with Latent Variables and Cutpoints

Limited Dependent Variables
Ordered Choice
Maximum Likelihood
Author

Jake Anderson

Published

March 3, 2026

Modified

March 4, 2026

Abstract

Job satisfaction rated 1-5, credit ratings AAA to D, health status poor/fair/good/excellent — these outcomes are ordinal but not cardinal. Ordered choice models posit a latent continuous variable partitioned by estimated cutpoints. Covariates shift the distribution, changing the probability of each category.

NotePrerequisites

This page assumes familiarity with the latent variable framework and maximum likelihood estimation from Qualitative and Limited Dependent Variable Models. The probit/logit background in Binary Choice Models is especially relevant since ordered models generalize binary ones to multiple ordered categories.

16.1 OLS on Ordinal Outcomes

A firm surveys employees on job satisfaction, rated 1 (Poor) through 5 (Excellent). Management wants to know whether wages, weekly hours, and supervisor support predict satisfaction. The natural first instinct is to code the categories as integers and run OLS:

\[ \text{Rating}_i = \beta_0 + \beta_1 \text{Wage}_i + \beta_2 \text{Hours}_i + \beta_3 \text{Support}_i + \varepsilon_i \tag{16.1}\]

This treats the categories as equally spaced on a cardinal scale. But there is no reason to believe that the gap between “Poor” (1) and “Fair” (2) is the same size as the gap between “Good” (4) and “Excellent” (5). OLS imposes that assumption by treating each one-unit increase as identical.

The difference between “Poor” (1) and “Fair” (2) need not equal the difference between “Good” (4) and “Excellent” (5). Ordinal data tells us the ranking but not the spacing.

Figure 16.1: OLS on ordinal data: the linear fit forces equal spacing between categories and allows predicted values outside the 1-5 range.

Beyond the spacing problem, OLS can predict values outside the valid range (e.g., \(\hat{y} = 0.3\) or \(\hat{y} = 5.7\)). It also forces constant marginal effects: a $1,000 wage increase always changes predicted satisfaction by the same amount, regardless of whether the worker is miserable or already content. These are the same issues that plague the linear probability model for binary outcomes, extended to the multi-category setting.

\(\implies\) We need a model that respects the ordinal nature of the dependent variable without imposing cardinal distances between categories.

16.2 The Latent Variable Model

The ordered probit/logit model assumes an unobserved (latent) continuous variable \(y_i^*\) drives the observed rating:

\[ y_i^* = \beta_1 \text{Wage}_i + \beta_2 \text{Hours}_i + \beta_3 \text{Support}_i + \varepsilon_i \tag{16.2}\]

The latent variable represents the worker’s underlying satisfaction on a continuous scale. We never observe \(y_i^*\) directly. Instead, it gets mapped to the discrete rating through a set of cutpoints (also called thresholds) \(\mu_1 < \mu_2 < \mu_3 < \mu_4\):

\[ \text{Rating}_i = \begin{cases} 1 & \text{if } y_i^* \leq \mu_1 \\ 2 & \text{if } \mu_1 < y_i^* \leq \mu_2 \\ 3 & \text{if } \mu_2 < y_i^* \leq \mu_3 \\ 4 & \text{if } \mu_3 < y_i^* \leq \mu_4 \\ 5 & \text{if } y_i^* > \mu_4 \end{cases} \tag{16.3}\]

For \(J\) categories, there are \(J - 1\) cutpoints. The cutpoints are estimated alongside the \(\beta\) coefficients via maximum likelihood.

With \(J - 1\) cutpoints and no intercept, the model has \(J - 1 + K\) parameters, where \(K\) is the number of regressors. Binary probit is the special case \(J = 2\), one cutpoint.

Figure 16.2: The latent variable density \(f(y^*)\) partitioned by cutpoints. Each shaded region corresponds to a category.

The intercept and the cutpoints are not separately identified. If you added an intercept \(\beta_0\) to Equation 16.2, you could always shift \(\beta_0\) up by some constant \(c\) and shift every cutpoint \(\mu_j\) up by the same \(c\) without changing any of the category probabilities. To resolve this, the convention is to drop the intercept and let the cutpoints absorb the location of the distribution.

This is the same identification issue that forces us to normalize \(\sigma = 1\) in binary probit: the latent scale has no natural units, so we need to fix enough parameters to pin it down.

Figure 16.3: Shifting covariates move the latent distribution relative to fixed cutpoints. Higher wages shift the density rightward, increasing the probability of higher ratings.

The figure above shows the core mechanism. The cutpoints stay fixed in place. When a covariate changes (e.g., wages increase), the entire latent distribution shifts. A rightward shift (positive \(\beta\)) pushes more probability mass past the upper cutpoints, raising the likelihood of higher categories.

16.3 The Probability Formula

The probability of observing category \(j\) for individual \(i\) comes directly from the latent variable model. Since \(\text{Rating}_i = j\) whenever \(\mu_{j-1} < y_i^* \leq \mu_j\):

\[ P(\text{Rating}_i = j) = F(\mu_j - X_i\beta) - F(\mu_{j-1} - X_i\beta) \tag{16.4}\]

where \(F\) is the CDF of \(\varepsilon\), \(X_i\beta = \beta_1 \text{Wage}_i + \beta_2 \text{Hours}_i + \beta_3 \text{Support}_i\), and we set \(\mu_0 = -\infty\) and \(\mu_J = +\infty\) for the boundary categories.

  • Ordered probit: \(F = \Phi\) (standard normal CDF)
  • Ordered logit: \(F = \Lambda\) (logistic CDF)

The cutpoints \(\mu_j\) are estimated alongside the \(\beta\)’s via MLE. The log-likelihood sums \(\ln P(\text{Rating}_i = j_i)\) over all observations, where \(j_i\) is the category actually observed for person \(i\).

16.3.1 Numeric Example

Suppose the estimated model yields \(\hat{\mu}_1 = -1.8\), \(\hat{\mu}_2 = -0.5\), \(\hat{\mu}_3 = 0.6\), \(\hat{\mu}_4 = 1.9\), and for a particular worker \(X_i\hat{\beta} = 0.4\). Using ordered probit (\(F = \Phi\)):

\[ \begin{aligned} P(\text{Rating} = 1) &= \Phi(-1.8 - 0.4) = \Phi(-2.2) = 0.014 \\ P(\text{Rating} = 2) &= \Phi(-0.5 - 0.4) - \Phi(-1.8 - 0.4) = \Phi(-0.9) - \Phi(-2.2) = 0.170 \\ P(\text{Rating} = 3) &= \Phi(0.6 - 0.4) - \Phi(-0.5 - 0.4) = \Phi(0.2) - \Phi(-0.9) = 0.395 \\ P(\text{Rating} = 4) &= \Phi(1.9 - 0.4) - \Phi(0.6 - 0.4) = \Phi(1.5) - \Phi(0.2) = 0.348 \\ P(\text{Rating} = 5) &= 1 - \Phi(1.9 - 0.4) = 1 - \Phi(1.5) = 0.067 \end{aligned} \tag{16.5}\]

The five probabilities sum to 1. This worker is most likely to report a rating of 3 or 4.

Figure 16.4: Predicted probability distribution across the five categories for a specific worker.

16.4 Interpretation and Marginal Effects

16.4.1 Coefficients Give Direction Only

Unlike OLS, the \(\beta\) coefficients in ordered probit/logit do not directly tell you the magnitude of the effect on any particular category. A positive \(\beta_k\) means that increasing \(x_k\) shifts the latent distribution to the right, which:

  • Always increases \(P(\text{highest category})\)
  • Always decreases \(P(\text{lowest category})\)
  • Has an ambiguous effect on middle categories

\(\implies\) Always compute marginal effects. Never interpret raw ordered model coefficients as though they were OLS slopes.

16.4.2 The Marginal Effect Formula

For a continuous covariate \(x_k\):

\[ \frac{\partial P(\text{Rating} = j)}{\partial x_k} = \left[ f(\mu_{j-1} - X\beta) - f(\mu_j - X\beta) \right] \cdot \beta_k \tag{16.6}\]

where \(f\) is the PDF corresponding to \(F\) (standard normal density \(\phi\) for probit, logistic density \(\lambda\) for logit).

For the extreme categories, the marginal effect simplifies because one of the boundary terms vanishes (\(f(-\infty) = 0\) or \(f(+\infty) = 0\)):

\[ \frac{\partial P(\text{Rating} = 1)}{\partial x_k} = -f(\mu_1 - X\beta) \cdot \beta_k \tag{16.7}\]

\[ \frac{\partial P(\text{Rating} = J)}{\partial x_k} = f(\mu_{J-1} - X\beta) \cdot \beta_k \tag{16.8}\]

For extreme categories, the sign of the marginal effect always matches (highest) or opposes (lowest) the sign of \(\beta_k\). For middle categories, the sign depends on the balance between the two density terms, which varies with \(X\beta\).

These have unambiguous signs: if \(\beta_k > 0\), the probability of the highest category increases and the probability of the lowest category decreases.

For middle categories, the sign of the marginal effect depends on the bracketed term \([f(\mu_{j-1} - X\beta) - f(\mu_j - X\beta)]\). This difference can be positive or negative depending on where the individual’s \(X\beta\) falls relative to the cutpoints.

Figure 16.5: Marginal effects of wages on each category’s probability. The effects on extreme categories have clear signs; the middle category effect is ambiguous.

16.4.3 Marginal Effects Sum to Zero

Across all \(J\) categories, the marginal effects must sum to zero:

\[ \sum_{j=1}^{J} \frac{\partial P(\text{Rating} = j)}{\partial x_k} = 0 \tag{16.9}\]

If an increase in \(x_k\) pushes probability mass toward higher categories, that mass must come from somewhere. The total probability is always 1, so gains in some categories are offset by losses in others.

Consider a worker whose \(X\beta\) places them right at cutpoint \(\mu_3\). A small increase in wages shifts the latent distribution rightward. This pulls probability mass from both sides of \(\mu_3\). Mass from category 3 shifts into category 4, and mass from category 2 shifts into category 3. If the outflow (3 to 4) exceeds the inflow (2 to 3), the net effect on \(P(\text{Rating} = 3)\) is negative, despite \(\beta > 0\).

The sign of the middle-category effect depends on the shape of the density at the two adjacent cutpoints. When the density is higher at the upper cutpoint than the lower one, the outflow dominates.

16.4.4 Binary Covariates: Discrete Differences

For a binary variable (e.g., a training program indicator), compute the discrete change rather than the derivative. Calculate predicted probabilities with the dummy set to 0 and then to 1, holding other covariates at their means or observed values:

\[ \Delta P(\text{Rating} = j) = P(\text{Rating} = j \mid D = 1) - P(\text{Rating} = j \mid D = 0) \]

16.5 The Parallel Regressions Assumption

Ordered probit and logit impose a restriction called the parallel regressions assumption (also known as the proportional odds assumption in the logit case): the same coefficient vector \(\beta\) applies at every cutpoint. Only the cutpoints \(\mu_j\) differ across category boundaries.

This means that a $1,000 wage increase shifts the latent distribution by the same amount regardless of whether we are looking at the boundary between “Poor” and “Fair” or the boundary between “Good” and “Excellent.” The effect of each covariate is the same across all thresholds.

Testing the parallel regressions assumption: estimate a series of binary models (collapsing categories at each cutpoint) and compare coefficients. If they differ substantially, the assumption may not hold. The Brant test formalizes this comparison.

When this assumption fails, the model is misspecified. Consider:

  • Generalized ordered logit: allows \(\beta\) to differ across cutpoints
  • Multinomial logit: drops the ordering assumption entirely (but discards ordering information)

In practice, the parallel regressions assumption is a strong restriction. Always think about whether it is reasonable for the specific application. If the factors driving someone from “Poor” to “Fair” are fundamentally different from those driving someone from “Good” to “Excellent,” ordered probit/logit may not be appropriate.

16.6 Ordered vs. Multinomial

Both ordered and multinomial models handle dependent variables with more than two categories. The choice depends on whether the categories have a natural ranking.

Choosing between ordered and multinomial models.
Feature Ordered Probit/Logit Multinomial Logit
Category structure Natural ordering (low to high) No ordering (unranked)
Number of \(\beta\) vectors 1 (shared across cutpoints) \(J - 1\) (one per non-base category)
Parsimony More parsimonious More flexible
Uses ordering info Yes No (ignores ranking)
Example Satisfaction ratings, credit grades Transport mode, brand choice

Decision rule:

  • If categories have a clear ranking (satisfaction: low \(<\) medium \(<\) high), use ordered probit/logit.
  • If categories are unranked (transport: car, bus, train), use multinomial logit.
  • If categories are ranked but the parallel regressions assumption fails, consider generalized ordered logit or, as a fallback, multinomial logit.

\(\implies\) Using multinomial logit on ordered data wastes information. Using ordered models on unordered data imposes a false ranking. Match the model to the data structure.

16.7 Slide Deck

16.8 Summary

Comparing approaches for categorical dependent variables.
Approach Assumption Strength Weakness
OLS on ordinal Equal spacing, unbounded Simple Predicted values outside range; ignores ordinal structure
Ordered Probit Latent normal errors; parallel regressions Respects ordering; parsimonious Coefficients not directly interpretable
Ordered Logit Latent logistic errors; proportional odds Odds-ratio interpretation available Same parallel regressions restriction
Multinomial Logit IIA; no ordering Fully flexible across categories Ignores ordering; requires more parameters

The core idea across all ordered choice models: an unobserved continuous variable is partitioned by estimated cutpoints. Covariates shift the distribution, and the cutpoints translate that shift into changes in category probabilities. The sign of \(\beta\) determines the direction for extreme categories but not necessarily for middle ones. Always report marginal effects.

TipWhat’s next?

Multinomial Logit covers unordered categorical outcomes, where the IIA assumption and alternative-specific covariates become central. See also the hub page for the full limited dependent variable roadmap.