2 Heteroskedasticity

Non-Constant Error Variance and Robust Inference

Heteroskedasticity

Inference

Cross-Section

Author

Jake Anderson

Published

March 3, 2026

Modified

May 17, 2026

Abstract

The error variance in a regression is allowed to differ across observations. When it does, OLS still produces unbiased and consistent coefficient estimates, but the usual standard error formula is wrong, and every test built on it is unreliable. This chapter develops the intuition for why the variance changes in realistic data, traces exactly where the OLS standard error formula breaks down, and shows three ways to fix inference: heteroskedasticity-consistent standard errors, weighted least squares, and feasible generalized least squares.

2.1 Motivation

Picture two households. The Lees bring in $30,000 a year; the Garcias bring in $200,000. Both have to eat, so both spend something on groceries. The Lees’ food spending is fairly predictable: when most of your income is locked into rent, utilities, and a car payment, you don’t have a lot of room to swing your weekly grocery bill. The Garcias’ food spending is much harder to predict from income alone. Some weeks they cook at home; some weeks they spend half a paycheck on a tasting menu downtown.

If we sat both households down and asked, “Given your income, how much will you spend on food this week?”, our answer for the Lees would be tight around the average. For the Garcias, the answer would have a much wider band of plausible values around the same average.

That widening band is the heart of the chapter. Formally, write the regression

\[ \text{food}_i = \beta_0 + \beta_1\,\text{income}_i + e_i. \tag{2.1}\]

The variance of the error term, conditional on income, may not be the same for everyone:

\[ \operatorname{Var}(e_i \mid x_i) = \sigma_i^2. \tag{2.2}\]

When all the $\sigma_i^2$’s are equal to one common $\sigma^2$, we have homoskedasticity (Greek for “same scatter”). When they differ across $i$, we have heteroskedasticity (“different scatter”). The textbook regression assumptions you learned in ECON 103 baked in homoskedasticity. Real economic data almost never does.

Heteroskedasticity means the variance of $e_i$ depends on $i$ (typically through $x_i$). The conditional mean assumption $\operatorname{E}[e_i \mid x_i] = 0$ is still imposed; only the variance changes.

2.2 Why Does the Variance Change?

Heteroskedasticity shows up whenever the spread of possible outcomes around the regression line genuinely depends on $x$. A few concrete patterns recur in applied work.

Discretion in spending. This is the food-expenditure story. When income is low, almost all of it is committed to necessities; there is little discretion and the household’s spending is tightly determined by the budget. When income is high, the household has the latitude to splurge, save, invest, or donate, and any of those choices can swing food spending by hundreds of dollars in a week. Higher income $\implies$ more freedom $\implies$ wider variance.

Returns to schooling. Regress hourly wage on years of education. At low education levels (high school dropouts, GED holders), wages cluster fairly tightly. At high education levels, wages fan out enormously. A PhD economist and a PhD English literature scholar can have the same number of years of education and earnings that differ by an order of magnitude. The error variance grows with education because the range of jobs and outcomes available at that level grows.

Firm size. Profits at large firms swing by millions of dollars from one quarter to the next; profits at small firms swing by thousands. Regress quarterly profit on firm size and the residuals will be tightly bunched for the small firms and wildly dispersed for the large firms.

Experience and earnings. Early-career workers cluster around an entry wage that the labor market sets fairly narrowly. As workers accumulate experience, paths diverge. Some get promoted, some switch industries, some leave the workforce, some get stuck. Residuals from a wage-on-experience regression will be tight near zero experience and wide near retirement age.

The common thread: in every case the variance is high where the underlying outcome has more room to vary. That is heteroskedasticity, and it is the normal state of affairs in cross-sectional economic data.

2.2.1 Play with it

The plot below simulates a single regression at $n = 250$. The sliders control two things:

True slope ($\beta_1$): how strong the underlying relationship is.
Heteroskedasticity intensity ($\gamma$): how fast the error variance grows with $x$. At $\gamma = 0$ the errors are homoskedastic; at $\gamma = 1.5$ the funnel is dramatic.

Watch the OLS slope estimate vs the OLS SE as you crank $\gamma$. The slope stays close to the true value (OLS is unbiased even with heteroskedasticity). The OLS SE and the robust SE start to disagree, and the gap widens as $\gamma$ grows. That is the whole point of the chapter.

viewof slope_true = Inputs.range([0, 3], {label: "True slope (β₁)", step: 0.1, value: 1.5})
viewof gamma = Inputs.range([0, 1.5], {label: "Heteroskedasticity intensity (γ)", step: 0.05, value: 0.6})

n_obs = 250

sim_data = {
  // Deterministic seed so the cloud doesn't reshuffle when sliders move.
  const rng = d3.randomNormal.source(d3.randomLcg(7))(0, 1);
  const beta0 = 5;
  const xs = Array.from({length: n_obs}, (_, i) => 1 + (i + 0.5) * 29 / n_obs);
  return xs.map(x => {
    const sigma_i = Math.pow(x, gamma);
    return { x, y: beta0 + slope_true * x + sigma_i * rng() };
  });
}

fit_stats = {
  const n_ = sim_data.length;
  const mx = d3.mean(sim_data, d => d.x);
  const my = d3.mean(sim_data, d => d.y);
  const sxx = d3.sum(sim_data, d => (d.x - mx) ** 2);
  const sxy = d3.sum(sim_data, d => (d.x - mx) * (d.y - my));
  const b1 = sxy / sxx;
  const b0 = my - b1 * mx;
  const resid = sim_data.map(d => d.y - b0 - b1 * d.x);
  const ssr = d3.sum(resid, e => e * e);
  const sigma2 = ssr / (n_ - 2);
  const se_ols = Math.sqrt(sigma2 / sxx);
  const sandwich = d3.sum(sim_data.map((d, i) =>
    Math.pow(d.x - mx, 2) * resid[i] * resid[i]));
  const se_hc1 = Math.sqrt((sandwich / (sxx * sxx)) * n_ / (n_ - 2));
  return { b0, b1, resid, se_ols, se_hc1, mx, sxx };
}

ols_fit_line = [
  { x: 1,  y: fit_stats.b0 + fit_stats.b1 * 1 },
  { x: 30, y: fit_stats.b0 + fit_stats.b1 * 30 }
]

resid_data = sim_data.map((d, i) => ({ x: d.x, e: fit_stats.resid[i] }))

// Pin axes so changing gamma does NOT rescale the plots.
// Domain depends only on slope_true (and the x range), not on gamma.
y_max_scatter = 5 + slope_true * 30 + 80
y_min_scatter = -40
y_extent_resid = 80

html`<div style="display: flex; gap: 1em; flex-wrap: wrap; justify-content: center;">
  <div>${Plot.plot({
    width: 560, height: 360, grid: true,
    title: "Scatter with OLS fit",
    x: { label: "x", domain: [0, 31] },
    y: { label: "y", domain: [y_min_scatter, y_max_scatter] },
    marks: [
      Plot.dot(sim_data, { x: "x", y: "y", fill: "steelblue", fillOpacity: 0.5, r: 3 }),
      Plot.line(ols_fit_line, { x: "x", y: "y", stroke: "black", strokeWidth: 2 })
    ]
  })}</div>
  <div>${Plot.plot({
    width: 560, height: 360, grid: true,
    title: "Residual plot",
    x: { label: "x", domain: [0, 31] },
    y: { label: "residual", domain: [-y_extent_resid, y_extent_resid] },
    marks: [
      Plot.ruleY([0], { stroke: "gray", strokeDasharray: "3,3" }),
      Plot.dot(resid_data, { x: "x", y: "e", fill: "tomato", fillOpacity: 0.6, r: 3 })
    ]
  })}</div>
</div>`

html`<div style="display: flex; justify-content: center;">
<table style="border-collapse: collapse; margin-top: 0.5em;">
  <thead><tr>
    <th style="padding: 4px 16px; border-bottom: 1px solid #999;">Quantity</th>
    <th style="padding: 4px 16px; border-bottom: 1px solid #999; text-align: right;">Value</th>
  </tr></thead>
  <tbody>
    <tr><td style="padding: 2px 16px;">True slope (β₁)</td>
        <td style="padding: 2px 16px; text-align: right;">${slope_true.toFixed(2)}</td></tr>
    <tr><td style="padding: 2px 16px;">OLS slope estimate</td>
        <td style="padding: 2px 16px; text-align: right;"><b>${fit_stats.b1.toFixed(3)}</b></td></tr>
    <tr><td style="padding: 2px 16px;">OLS SE (assumes homoskedasticity)</td>
        <td style="padding: 2px 16px; text-align: right;">${fit_stats.se_ols.toFixed(3)}</td></tr>
    <tr><td style="padding: 2px 16px;">Robust SE (HC1)</td>
        <td style="padding: 2px 16px; text-align: right;">${fit_stats.se_hc1.toFixed(3)}</td></tr>
    <tr><td style="padding: 2px 16px;"><i>Ratio (robust / OLS)</i></td>
        <td style="padding: 2px 16px; text-align: right;"><i>${(fit_stats.se_hc1 / fit_stats.se_ols).toFixed(2)}×</i></td></tr>
  </tbody>
</table>
</div>`

A few things to try:

Set $\gamma = 0$ and confirm: the residual cloud is flat, the two SEs are nearly identical. With homoskedastic errors the OLS formula is correct.
Push $\gamma$ to $1.0$ and beyond. The residual cloud opens into a funnel, and the robust SE pulls away from the OLS SE.
With $\gamma$ large, drag the slope to $0$. The slope estimate hugs zero (unbiasedness is intact), but the SE comparison still tells you OLS would have given the wrong inference about it.
Compare the residual plot at $\gamma = 0$ to the same plot at $\gamma = 1$. Same axes, very different cloud shape, this is exactly what a residual diagnostic is supposed to surface.

Think: Can you come up with another example?

Some that come up in applied work:

County-level crime rates regressed on population: small counties have noisy rates because a few incidents move the per-capita number; big counties’ rates are stable.
Student test scores regressed on study time: barely any studying produces a narrow range of (low) scores; lots of studying opens up a wide range depending on how effective the studying is.
House prices regressed on square footage: small starter homes have a narrow price range; mansions vary enormously by location, lot, and finishings.

In each case the variance grows with $x$ because the outcome’s natural range grows with $x$.

2.3 What Goes Wrong with OLS

2.3.1 The Coefficients Are Still Fine

Here is the surprisingly clean piece of news: OLS coefficient estimates remain unbiased and consistent under heteroskedasticity. The argument is one line. Write the OLS slope formula:

\[ \hat\beta_1 = \beta_1 + \frac{\sum_i (x_i - \bar x)\,e_i}{\sum_i (x_i - \bar x)^2}. \tag{2.3}\]

Take the expectation conditional on $x$. The denominator is a function of $x$, so it sits inside the expectation as a constant. The numerator has $e_i$ in it, and we assumed $\operatorname{E}[e_i \mid x_i] = 0$. So $\operatorname{E}[\hat\beta_1 \mid x] = \beta_1$. Unbiasedness drops out. Consistency works the same way as $n \to \infty$.

Nothing in that argument used the variance of $e_i$. Whether $\operatorname{Var}(e_i \mid x_i) = \sigma^2$ for all $i$, or whether it varies wildly across observations, the OLS coefficient is fine.

The conditional mean assumption $\operatorname{E}[e_i \mid x_i] = 0$ is what keeps OLS unbiased. The variance assumption is separate.

2.3.2 But the Standard Errors Are Wrong

The damage shows up in the standard error formula. Under the homoskedastic regression model, the textbook SE is

\[ \widehat{\operatorname{Var}}_{\text{OLS}}(\hat\beta_1) = \frac{\hat\sigma^2}{\sum_i (x_i - \bar x)^2}, \qquad \hat\sigma^2 = \frac{\sum_i \hat e_i^2}{n - 2}. \tag{2.4}\]

Notice what $\hat\sigma^2$ is doing. It averages every squared residual into one single number and treats that as the variance of $e_i$ for every observation. That is exactly what homoskedasticity says is allowed.

Under heteroskedasticity it is wrong, and you can see the failure intuitively. If observation 5 has a much larger variance than observation 6, but we pool them into one average $\hat\sigma^2$, then we are over-stating the variance of observation 6 (which is actually tight) and under-stating the variance of observation 5 (which is actually wild). The single-number summary is a lie.

The true variance of the OLS slope under heteroskedasticity has each observation’s own $\sigma_i^2$ in the numerator:

\[ \operatorname{Var}(\hat\beta_1 \mid x) = \frac{\sum_i (x_i - \bar x)^2\,\sigma_i^2}{\Big[\sum_i (x_i - \bar x)^2\Big]^2}. \tag{2.5}\]

This is sometimes called the sandwich formula because the variance-weighted sum is sandwiched between two copies of $\sum (x_i - \bar x)^2$. The OLS formula in Equation 2.4 is what you get from Equation 2.5 only if every $\sigma_i^2$ equals one common $\sigma^2$. Drop that and the OLS formula misses the truth in a way that does not vanish as $n$ grows.

The OLS standard error is inconsistent under heteroskedasticity

“Inconsistent” is technical language for: more data does not save you. With homoskedastic errors and a large enough sample, your OLS SE converges to the true SE. With heteroskedastic errors, your OLS SE converges to the wrong number, no matter how big $n$ gets. Standard lm() output is reporting a confident answer to a question the formula is no longer entitled to answer.

Three consequences follow:

Standard errors are wrong, and the error doesn’t shrink with sample size.
$t$-statistics, $p$-values, and confidence intervals built on those standard errors are unreliable. You will sometimes reject the null when you shouldn’t, sometimes fail to reject when you should, and you cannot tell which is happening from the coefficient table.
OLS is no longer BLUE. It is still linear and unbiased, but the “B” in “Best Linear Unbiased Estimator” required the Gauss-Markov assumptions, and homoskedasticity was one of them. Some other linear unbiased estimator (in particular, weighted least squares with the right weights) has a smaller variance.

Why this is sneaky

Heteroskedasticity does not show up in the coefficient table. The estimates $\hat\beta_0, \hat\beta_1, \ldots$ look completely normal. The only way to see it is to look at the residuals or run a formal test. If you skip diagnostics, you can publish, defend, and even win awards on inference that is wrong.

2.4 Visual Detection: The Residual Plot

The first diagnostic is also the cheapest. Plot the residuals $\hat e_i$ against the fitted values $\hat y_i$ (or against a regressor $x_i$). A regression with homoskedastic errors should produce a featureless cloud of points: roughly constant vertical width across the whole horizontal range, no shape, no trend.

The classic heteroskedasticity warning sign is a funnel or megaphone pattern: the residuals start tightly bunched on the left and fan out to the right (or vice versa). This is exactly what we saw conceptually with the Lees and the Garcias. Plot food spending residuals against income and the high-income observations form a much taller cloud than the low-income ones.

A flat cloud is a green light to proceed with the homoskedastic SE formula. A funnel is a yellow light: it means the residuals are telling you something, and you should follow up with a formal test before reporting inference.

The plot is suggestive, not proof

A funnel-looking residual plot is consistent with heteroskedasticity but doesn’t prove it. Random sampling can produce funnel-shaped clouds even when the truth is homoskedastic. That is why we lean on formal tests: a test statistic and $p$-value is a number you can put in a paper.

2.5 Three Formal Tests

We will work through three tests in turn: Breusch-Pagan (BP), White, and Goldfeld-Quandt (GQ). Each makes a slightly different assumption about how the variance might depend on $x$. None is universally powerful; understanding when each works and when each fails is most of the practical knowledge.

2.5.1 Breusch-Pagan

The idea. If heteroskedasticity is present, the squared residuals $\hat e_i^2$ are estimates of the local variance $\sigma_i^2$. If $\sigma_i^2$ depends on $x_i$, then $\hat e_i^2$ should be predictable from $x_i$. Regress one on the other; if the relationship is statistically significant, reject homoskedasticity.

BP assumes the relationship is linear: as $x$ rises, the variance either rises or falls in a roughly straight line. The auxiliary regression is

\[ \hat e_i^2 = \gamma_0 + \gamma_1 x_i + v_i. \tag{2.6}\]

Under the null of homoskedasticity, $\gamma_1 = 0$. The test statistic is

\[ \text{BP} = n \cdot R^2_{\text{aux}} \;\sim\; \chi^2(k) \text{ under } H_0, \tag{2.7}\]

where $k$ is the number of slope regressors in the auxiliary regression (the intercept is not counted) and $R^2_{\text{aux}}$ is the $R^2$ from that auxiliary regression. Bigger $R^2$ means $x$ explains a bigger share of the variation in $\hat e_i^2$, which means stronger evidence of heteroskedasticity.

The procedure in four steps:

Run OLS on the original model. Save the residuals $\hat e_i$.
Regress $\hat e_i^2$ on the original regressors. Get $R^2_{\text{aux}}$.
Compute $\text{BP} = n \cdot R^2_{\text{aux}}$.
Compare to $\chi^2(k)$ at your chosen significance level. Reject if $\text{BP}$ exceeds the critical value.

The version of BP we use is the studentized (Koenker) form, which is what R’s bptest() reports by default and what HGL teaches.

When BP works and when it fails. BP looks for a linear trend in the squared residuals. So it works beautifully when the variance grows (or shrinks) more or less monotonically with $x$, which is the funnel case. It fails when the variance pattern is non-monotonic in a way that a line cannot detect:

U-shaped variance (high at both extremes, low in the middle): a line through this cloud is nearly flat. BP misses it.
Inverted-U variance (high in the middle, low at the extremes): same problem, flipped. The line averages out to flat.
Oscillating variance: any line through a wave is flat. BP misses it.
Clustered variance (two subgroups with different variances but overlapping ranges of $x$): variance depends on group membership, not on $x$. BP regresses $\hat e_i^2$ on $x$ and finds nothing.

The clustered case is important enough that we treat it separately with cluster-robust standard errors.

2.5.2 White

The White test relaxes BP’s linearity assumption. Instead of regressing $\hat e_i^2$ on just $x$, regress it on $x$, $x^2$, and (if you have multiple regressors) the cross-products. The auxiliary regression becomes

\[ \hat e_i^2 = \alpha_0 + \alpha_1 x_i + \alpha_2 x_i^2 + v_i \tag{2.8}\]

for a single regressor case. With $k$ regressors, you include all $k$ levels, all $k$ squares, and all $\binom{k}{2}$ pairwise cross-products. The test statistic has the same form, $\text{White} = n \cdot R^2_{\text{aux}}$, distributed $\chi^2(q)$ under $H_0$, where $q$ is the number of slope regressors in the auxiliary regression.

Where White outperforms BP. Adding the quadratic term lets the auxiliary regression curve. So the inverted-U and U-shaped patterns that BP misses, White picks up: the quadratic fit dips or peaks in the middle and rises (or falls) at the extremes, which is exactly the shape of the variance.

Where White still fails. A quadratic in $x$ has only one extremum. It cannot track multiple peaks and troughs, so oscillating variance still defeats White. And no polynomial in $x$ can capture group membership when variance depends on a separate categorical variable, so the clustered case still defeats White too.

White has a cost: degrees of freedom

The auxiliary regression for White has $q = k(k+3)/2$ slope regressors with $k$ original regressors. With $k = 5$, that is 20 terms. The test has many degrees of freedom, and power drops if you don’t have a large sample.

2.5.3 Goldfeld-Quandt

GQ takes a different approach. Instead of fitting a model for the variance, it compares the variance directly across two subsamples.

The idea. If variance grows with $x$, then a high-$x$ subsample should have a larger residual variance than a low-$x$ subsample. Sort, split, run separate regressions, compare.

The procedure.

Sort the observations by the suspect variable $x$.
Drop the middle $c$ observations (typically $c \approx n/5$). Throwing away the middle sharpens the contrast between low- and high-variance groups.
Run separate OLS on the bottom group and the top group. Compute $\text{SSE}_{\text{bot}}$ and $\text{SSE}_{\text{top}}$ (the residual sums of squares).
Form the ratio. By convention, put the group you suspect has larger variance on top:

\[ \text{GQ} = \frac{\text{SSE}_{\text{top}}/(n_{\text{top}} - k)}{\text{SSE}_{\text{bot}}/(n_{\text{bot}} - k)} \;\sim\; F(n_{\text{top}} - k,\,n_{\text{bot}} - k) \text{ under } H_0, \tag{2.9}\]

where $k$ is the number of regressors in each subsample regression (including intercept). This is a one-sided test: reject $H_0$ if $\text{GQ}$ exceeds the right-tail critical value of the $F$ distribution.

You have to know which side is bigger

GQ tests a one-sided alternative. You decide before running the test which group is supposed to have the bigger variance, and you put its SSE on top. If you guess wrong, you would need the left-tail critical value, which most $F$ tables don’t even list. Look at a residual plot first so you know which way the funnel opens.

Where GQ works. Linear up and linear down variance, with the correct group on top: GQ detects both cleanly. The ratio is large; the $F$ statistic blows past the critical value.

Where GQ fails. The U-shape and inverted-U undo GQ in a particularly devious way: when variance is high at both extremes, both the top and bottom groups have wide spread, and the dropped middle is exactly the low-variance region (or the high-variance region in the inverted case). The two retained groups end up with similar SSEs, and the ratio is close to 1. Clustered variance defeats GQ for the same reason as BP: sorting by $x$ doesn’t separate clusters that overlap in $x$, so both groups end up with a mix of high- and low-variance observations.

2.5.4 When to Use Which Test

Variance pattern	BP	White	GQ
Linear up/down	✓	✓	✓
U-shape	✗	✓	✗
Inverted U-shape	✗	✓	✗
Oscillating	✗	✗	✗
Clustered (subgroup)	✗	✗	✗

The practical advice: start with the residual plot to get a sense of the shape. If you see a clean funnel, BP or GQ will pick it up. If you see a hump or a bowl, reach for White. If you see clusters or oscillation, you need either cluster-robust standard errors or a different specification entirely.

2.6 Worked Example: Wages and Experience

We have 1,000 workers, each with observed experience (years on the job) and hourly wage. Estimate the linear model

\[ \text{wage}_i = \beta_0 + \beta_1\,\text{experience}_i + e_i. \tag{2.10}\]

OLS gives $\hat\beta_0 = 8.84$, $\hat\beta_1 = 1.08$, both significant at any conventional level. Each additional year of experience is associated with about $1.08 more per hour. The OLS $R^2$ is 0.56, the OLS slope SE is 0.030, and on the surface this looks like a very clean result.

But the residual plot shows a clear funnel: wages for high-experience workers are much more spread out than wages for entry-level workers, exactly as we predicted in Section 2.2. We need a formal test before trusting the SE.

Step 1, Hypotheses. Under the BP assumption that $\operatorname{Var}(e_i)$ is linear in experience:

\[ H_0: \gamma_1 = 0, \qquad H_1: \gamma_1 \neq 0. \]

Step 2, Auxiliary regression. Square the OLS residuals and regress them on experience:

\[ \hat e_i^2 = -18.59 + 5.93\,\text{experience}_i + v_i, \qquad R^2_{\text{aux}} = 0.158. \]

The slope is positive: squared residuals climb steeply with experience.

Step 3, Test statistic.

\[ \text{BP} = n \cdot R^2_{\text{aux}} = 1{,}000 \times 0.158 = 158.4. \]

Step 4, Critical value. The auxiliary regression has one slope regressor (experience), so we compare to $\chi^2(1)$. At $\alpha = 0.05$ the critical value is $3.841$.

Step 5, Conclusion. $158.4 \gg 3.841$, so we reject homoskedasticity at any conventional significance level. The OLS SE of $0.030$ is not a reliable estimate of the true sampling variance of $\hat\beta_1$.

Now run GQ on the same data as a cross-check. Sort by experience, drop the middle 200 (20%), run separate regressions on the bottom 400 and top 400 (each with 398 degrees of freedom after fitting the two-parameter model). The two SSEs come out to $\text{SSE}_{\text{bot}} = 6{,}962$ and $\text{SSE}_{\text{top}} = 50{,}159$. The ratio is

\[ \text{GQ} = \frac{50{,}159/398}{6{,}962/398} = \frac{50{,}159}{6{,}962} = 7.21, \]

against the one-sided critical value $F_{0.05}(398, 398) \approx 1.180$. Reject again, decisively. Both tests agree: variance grows strongly with experience.

Think: A researcher runs BP and gets $R^2_{\text{aux}} = 0.0847$ with $n = 88$. The auxiliary regression had two slope regressors. What is the test statistic, and what do they conclude at $\alpha = 0.05$?

Test statistic: $\text{BP} = 88 \times 0.0847 = 7.45$.

Compare to $\chi^2(2)$ at $\alpha = 0.05$: critical value is $5.99$.

Since $7.45 > 5.99$, reject homoskedasticity. Squared residuals are jointly predictable from the regressors at the 5% level.

2.7 Fixing the Problem: Three Approaches

The test rejects homoskedasticity. Now what? Three responses, ordered from least to most demanding.

Keep OLS, fix the standard error. Use a heteroskedasticity-consistent (HC) standard error formula. The coefficient estimates do not change; only the SE changes. This is the default in modern applied econometrics.
Use weighted least squares (WLS). If you know how the variance depends on $x$, you can re-weight the observations so that the transformed errors are homoskedastic. WLS is more efficient than OLS-with-robust-SEs when the weights are right.
Estimate the variance, then use feasible generalized least squares (FGLS). When you don’t know the variance function, you can estimate it from the residuals and then run WLS with the estimated weights.

We work through each in turn.

2.7.1 Option 1: Robust Standard Errors

The robust SE replaces the pooled $\hat\sigma^2$ in the OLS formula with each observation’s own squared residual. The HC0 (White) variance estimator is

\[ \widehat{\operatorname{Var}}_{\text{HC0}}(\hat\beta_1) = \frac{\sum_i (x_i - \bar x)^2\,\hat e_i^2}{\Big[\sum_i (x_i - \bar x)^2\Big]^2}. \tag{2.11}\]

Compare this side by side with the true variance Equation 2.5. The robust estimator plugs in $\hat e_i^2$ as a consistent estimator of $\sigma_i^2$, observation by observation, rather than averaging everything together. The estimator is consistent for the true sampling variance of $\hat\beta_1$ under heteroskedasticity of any form, which is why it is called heteroskedasticity-consistent.

HC1 applies a small-sample correction, multiplying HC0 by $n/(n-k)$. This is the default in most software (Stata’s robust, R’s hccm(model, type="hc1")). For large samples the difference between HC0 and HC1 is negligible; for small samples HC1 is the safer choice.

What changes when you switch to robust SEs:

The point estimate $\hat\beta_1$ does not change. You are still running OLS.
Only the standard error changes, which in turn changes the $t$ statistic, the $p$ value, and the width of the confidence interval.
When variance grows with $|x_i - \bar x|$ (the typical macro / micro case), the OLS SE is too small, so the robust SE is wider. But this is not guaranteed: in finite samples the robust SE can come out smaller than the OLS SE. The robust SE is correct either way; the OLS SE is wrong either way.

Think: Can the robust SE be smaller than the OLS SE?

Yes. The robust SE replaces the average squared residual with an observation-weighted sum. Whether that sum is larger or smaller than the average depends on whether the high-variance observations sit at high-leverage points (large $|x_i - \bar x|$). If they do, the robust SE is larger. If they don’t (or if by sampling variation the heavy residuals happen to land at the center), the robust SE can be smaller. Either way, it is the correct one to report.

In R the typical workflow is:

Show R code

library(car); library(lmtest); library(sandwich)

model <- lm(food_exp ~ income, data = food_data)

# Default OLS SEs (assumes homoskedasticity)
summary(model)

# HC1 robust SEs
coeftest(model, vcov. = vcovHC(model, type = "HC1"))

The default reporting convention in applied econometrics is to publish robust SEs in parentheses below your coefficients unless you have a strong reason to believe homoskedasticity holds. The cost of robust SEs when the truth is homoskedasticity is small; the cost of OLS SEs when the truth is heteroskedasticity is potentially the whole paper.

2.7.2 Option 2: Weighted Least Squares

Robust SEs fix inference but do not improve efficiency. The point estimates are still OLS, and OLS is no longer BLUE when errors are heteroskedastic. If you have a credible model for how the variance depends on $x$, weighted least squares can do better.

The intuition. Observations with smaller error variance carry more information about the regression line. We should weight them more heavily when we fit. If $\operatorname{Var}(e_i \mid x_i) = \sigma_i^2$, the WLS weight on observation $i$ is

\[ w_i = \frac{1}{\sigma_i^2}. \tag{2.12}\]

Low-variance observations get high weights; high-variance observations get low weights. WLS minimizes the weighted sum of squared residuals:

\[ \sum_i w_i(y_i - \beta_0 - \beta_1 x_i)^2 = \sum_i \frac{(y_i - \beta_0 - \beta_1 x_i)^2}{\sigma_i^2}. \tag{2.13}\]

Why this is the same as transforming the model. Divide both sides of the regression equation by $\sigma_i$:

\[ \frac{y_i}{\sigma_i} = \beta_0 \cdot \frac{1}{\sigma_i} + \beta_1 \cdot \frac{x_i}{\sigma_i} + \frac{e_i}{\sigma_i}. \tag{2.14}\]

The transformed error $e_i^* = e_i/\sigma_i$ has unit variance, and OLS applied to the transformed variables is exactly homoskedastic. So WLS is OLS on the rescaled data, and Aitken’s theorem says it is BLUE provided the weights are right.

Common weight specifications. In applied work you rarely know $\sigma_i^2$ exactly, but you can sometimes argue that it has a particular form:

If $\operatorname{Var}(e_i) \propto x_i$, set $w_i = 1/x_i$.
If $\operatorname{Var}(e_i) \propto x_i^2$, set $w_i = 1/x_i^2$.
For grouped data where group $i$ has $n_i$ observations averaged into the row, $\operatorname{Var}(e_i) \propto 1/n_i$, so $w_i = n_i$.

Notice that you only need the weights up to a positive scaling constant: multiplying every $w_i$ by the same number leaves the WLS estimate unchanged.

R’s weights argument is a common exam trap

R’s lm() function expects the weights argument to be $w_i = 1/\sigma_i^2$, not $\sigma_i^2$. Under the hood, R divides each observation by $1/\sqrt{w_i}$. So if you believe $\operatorname{Var}(e_i) \propto \text{income}_i$, you write weights = 1/income, and R will divide each row by $\sqrt{\text{income}_i}$. A common student error is to write weights = income (the variance), which gives the wrong WLS estimate.

The catch. WLS is BLUE only when the weights are exactly right. If your variance specification is wrong, WLS is still consistent (the conditional mean assumption alone gives you that), but it might not be more efficient than OLS, and the WLS standard error formula will be wrong. So the move from OLS to WLS trades one assumption (homoskedasticity) for another (a specific variance form). If you can’t defend the specific form, you might be better off staying with OLS and using robust SEs.

2.7.3 Option 3: Feasible Generalized Least Squares

If you suspect the variance has some systematic form but don’t know exactly what, you can estimate it from the data. This is feasible GLS: “feasible” because we don’t know the true $\sigma_i^2$’s, so we estimate them, then use those estimates as if they were the truth.

The three steps.

Estimate the variance function. Run OLS, save residuals $\hat e_i$. Postulate a model for the log variance:

\[ \sigma_i^2 = \exp(\alpha_1 + \alpha_2 z_i), \tag{2.15}\]

where $z_i$ is whatever observable you think drives the variance ($x_i$, $\ln x_i$, or another variable). Logging the variance is what keeps fitted variances positive and turns a multiplicative variance model into something OLS can fit linearly. Regress the log squared residuals on $z_i$:

\[ \ln(\hat e_i^2) = \alpha_1 + \alpha_2 z_i + v_i. \tag{2.16}\]

The fitted variance is $\hat\sigma_i^2 = \exp(\hat\alpha_1 + \hat\alpha_2 z_i)$.

Decompose the variance into baseline times skedastic function. Write

\[ \hat\sigma_i^2 = \underbrace{e^{\hat\alpha_1}}_{\textstyle \hat\sigma^2}\,\cdot\,\underbrace{e^{\hat\alpha_2 z_i}}_{\textstyle \hat h_i}. \tag{2.17}\]

The first piece, $\hat\sigma^2$, is a constant baseline variance shared by every observation. The second piece, $\hat h_i$, is the skedastic function: it tells you how much observation $i$’s variance differs from the baseline. Dividing every observation by $\sqrt{\hat h_i}$ is enough to recover homoskedasticity; the constant $\hat\sigma$ has no effect on relative weighting and can be dropped.

Why does dividing by $\sqrt{\hat h_i}$ work? Define the rescaled error $e_i^* = e_i/\sqrt{h_i}$. Then

\[ \operatorname{Var}(e_i^* \mid x_i) = \frac{1}{h_i}\operatorname{Var}(e_i \mid x_i) = \frac{1}{h_i}\,\sigma^2 h_i = \sigma^2, \tag{2.18}\]

a constant. The $h_i$’s cancel and we are back in homoskedasticity-land.

Run OLS on the transformed data. Apply the rescaling to every column of the regression:

\[ y_i^* = \frac{y_i}{\sqrt{\hat h_i}}, \qquad x_{i0}^* = \frac{1}{\sqrt{\hat h_i}}, \qquad x_{i1}^* = \frac{x_{i1}}{\sqrt{\hat h_i}}. \tag{2.19}\]

You have to divide every column, including the intercept’s column of 1’s, because the transformation must apply consistently to both sides of the regression equation. Running OLS on $(y_i^*, x_{i0}^*, x_{i1}^*, \ldots)$ produces the FGLS estimator $\hat\beta^{\text{FGLS}}$.

Properties.

If the variance model is correctly specified, FGLS is consistent and asymptotically as efficient as the (infeasible) true GLS estimator that uses the actual $\sigma_i^2$’s.
If the variance model is misspecified, FGLS is still consistent for $\beta$ (the conditional mean assumption is doing the work), but the efficiency gain is not guaranteed and might disappear.
You can combine FGLS with robust standard errors for belt-and-suspenders inference: FGLS gives you efficiency in the typical case, and robust SEs protect against variance misspecification.

2.8 End-to-End: HGL Food Expenditure

Now run the whole pipeline on the textbook food-expenditure data ($n = 40$ households). The structural model is

\[ \text{food}_i = \beta_1 + \beta_2\,\text{income}_i + e_i. \]

OLS estimates (assuming homoskedasticity for the moment):

	$\hat\beta_1$ (intercept)	$\hat\beta_2$ (income)
Estimate	83.42	10.21
OLS SE	43.41	2.09

The $t$ statistic for the slope is $10.21 / 2.09 = 4.88$, significant at any conventional level. But the residual plot shows a clear funnel, and Goldfeld-Quandt yields $\text{GQ} = 3.61 > F_{0.05}(14, 14) = 2.48$, so we reject homoskedasticity. The OLS SE of 2.09 is not the number we should trust.

Two routes forward, A (robust SEs) and B (FGLS). Compute both.

Route A: robust standard errors.

	SE($\hat\beta_2$)	95% CI for $\beta_2$	CI width
OLS	2.09	(5.97, 14.45)	8.47
Robust (HC1)	1.81	(6.55, 13.87)	7.33

The robust SE is actually slightly smaller than the OLS SE in this sample. The OLS SE was technically wrong, but the direction of the error happens to make OLS slightly conservative this time. This is a useful reminder: the OLS SE is invalid under heteroskedasticity, but whether it over- or under-states the truth depends on how $\sigma_i^2$ correlates with $(x_i - \bar x)^2$ in the realized data. The coefficient stays at 10.21.

Route B: FGLS. Fit the variance function $\ln(\hat e_i^2) = \alpha_1 + \alpha_2 \ln(\text{income}_i)$. The estimated coefficients are $\hat\alpha_1 = 0.94$ and $\hat\alpha_2 = 2.33$, which says the variance grows roughly with the square of income. The skedastic function is $\hat h_i = \text{income}_i^{2.33}$. Rescale every observation by $1/\sqrt{\hat h_i}$ and rerun OLS:

\[ \widehat{\text{food}}^{\,\text{FGLS}} = 76.05 + 10.63\,\text{income}, \qquad \text{SE}(\hat\beta_2^{\text{FGLS}}) = 0.97. \]

The slope barely moves (10.21 to 10.63 is small relative to either standard error), confirming OLS was unbiased. But the FGLS SE is roughly half of both the OLS SE and the robust SE. That is the efficiency gain we earn from down-weighting the noisy high-income observations.

Comparison.

Method	$\hat\beta_2$	SE	95% CI
OLS (homosk. SE)	10.21	2.09	(5.97, 14.45)
OLS + Robust HC1	10.21	1.81	(6.55, 13.87)
FGLS ($h_i = \text{income}_i^{2.33}$)	10.63	0.97	(8.67, 12.60)

Reading the table:

The coefficient point estimate barely moves between OLS and FGLS, which is consistent with OLS being unbiased.
The OLS SE and the robust SE differ only modestly here. In other samples the gap can be much larger and can run either way.
The FGLS SE is half the size of the other two, because the variance model is correctly specified and FGLS uses that information to give more weight to the low-variance (low-income) observations.

The R code that produces this table:

Show R code

library(car); library(lmtest); library(sandwich)

# 1. OLS
m_ols <- lm(food ~ income, data = food_data)
summary(m_ols)

# 2. OLS with HC1 robust SEs
V_hc1 <- vcovHC(m_ols, type = "HC1")
coeftest(m_ols, vcov. = V_hc1)

# 3. FGLS: estimate variance function, then refit with weights
log_resid_sq <- log(resid(m_ols)^2)
var_fit      <- lm(log_resid_sq ~ log(income), data = food_data)
h_hat        <- exp(fitted(var_fit))
m_fgls       <- lm(food ~ income, data = food_data, weights = 1/h_hat)
summary(m_fgls)

Think: Suppose you don’t trust the variance function. Which estimator should you report?

Robust SEs from OLS. They are valid under any heteroskedasticity pattern (the auxiliary regression for FGLS doesn’t have to be right), so they protect you against misspecification of the variance function. The cost is the efficiency you gave up by not using the variance information; the benefit is robustness if your variance story is wrong.

Many applied papers report both, side by side: “the column-(2) FGLS result is more efficient if our variance model is right, but the column-(3) OLS-with-robust-SEs result is valid even if it isn’t.”

2.9 Choosing Between OLS+Robust, WLS, and FGLS

A quick decision rule:

OLS with robust SEs is the safest default. You give up efficiency if there is a strong variance pattern you could be exploiting, but you don’t have to defend a variance specification. This is what most applied papers report.
WLS is the right choice when you have a credible argument for the variance form (grouped data with known group sizes, a clear theoretical reason variance scales with $x^2$, etc.). The efficiency gain is real only if the weights are right.
FGLS is for when you suspect variance depends on $x$ in some specific way but don’t know the exact form. Estimate the variance function from the residuals, then transform.

In every case, the coefficient point estimate is similar, OLS is unbiased even when its standard errors are wrong. The choice is about how to get correct, and ideally efficient, inference around the coefficient.

2.10 What’s Next

Heteroskedasticity in cross-sections leads directly to several more advanced topics:

Cluster-robust standard errors handle the “clustered variance” case we saw repeatedly fail BP, White, and GQ: variance differs by group rather than by $x$.
Time series introduces a different kind of dependence: errors that are correlated across observations rather than just differently scaled. The fix uses HAC (heteroskedasticity-and-autocorrelation-consistent) standard errors, a generalization of the robust SEs in this chapter.
Panel data combines cross-section and time series and inherits both problems.

In every case the same logic from this chapter carries: figure out where the textbook OLS standard error formula stops being valid, write down the correct sandwich form, and either replace the SE estimator or transform the data to restore the assumption.

Download slides (PDF)

Download presentation slides (with transitions) (PDF)