3 Time Series

Modeling Temporal Dependence in Economic Data

Time Series

Forecasting

Serial Correlation

Author

Jake Anderson

Published

March 3, 2026

Modified

May 17, 2026

Abstract

Cross-sectional regression assumes observations are independent draws. Time-series data violates that: today’s outcome carries information about tomorrow’s. This chapter develops the toolkit for taking that dependence seriously: when stationarity holds (and when it doesn’t), how to specify dynamics in the model itself (AR, DL, ARDL), what happens when dynamics leak into the errors, how to detect serial correlation, and how to correct inference with HAC standard errors. Closes with forecasting and unit-root testing.

3.1 Motivation

Pull up any macroeconomic series. Quarterly GDP growth, monthly unemployment, daily stock returns, weekly oil prices. Look at consecutive observations. They are not independent draws. Today’s GDP is mostly yesterday’s GDP plus a small change. The unemployment rate this month is mostly the rate last month. Even daily stock returns, which are close to unpredictable, carry a hint of volatility clustering.

This temporal dependence has nothing to do with bias in measurement or anything we can sample our way out of. It is a property of the data-generating process. When you have \(n\) household incomes drawn from a national survey, the second household tells you nothing about the first. When you have \(T\) quarters of GDP, the second quarter is built on the first.

Two things follow. First, regressions on time-series data need to allow the past to influence the present. We do that with dynamic specifications (AR, DL, ARDL) that put lagged variables on the right-hand side. Second, even after specifying the right dynamics in our model, the errors may still be correlated across time. When they are, OLS coefficient estimates remain unbiased (often) but the standard errors are wrong, and every inference built on them is unreliable. Both problems live in this chapter.

Serial correlation (also called autocorrelation) means \(\operatorname{Cov}(e_t, e_{t-k}) \neq 0\) for some lag \(k \geq 1\). The textbook regression assumption that errors are uncorrelated across observations fails.

3.2 Two Examples to Keep in Mind

Quarterly inflation. The CPI inflation rate this quarter is close to last quarter’s. A 4% reading is followed by another 4% (or 3.8%, or 4.2%) far more often than by a wild swing. Supply shocks (oil prices, supply chains) tend to persist for several quarters before fading. So inflation is well-described by an autoregressive process: \(\pi_t \approx \alpha + \phi \pi_{t-1} + e_t\), with \(\phi\) around 0.5-0.9 for most post-war US data.

Advertising and sales. Suppose a company runs ads in month \(t\). Some sales happen immediately because shoppers see the ad and click. Some happen next month because shoppers needed time to consider. Some happen the month after that because brand awareness builds. The effect of advertising spending \(\text{ads}_t\) on sales \(y_t\) is spread across multiple months: \(y_t \approx \alpha + \beta_0 \text{ads}_t + \beta_1 \text{ads}_{t-1} + \beta_2 \text{ads}_{t-2} + e_t\). That is a distributed-lag specification.

These two examples cover the two sources of dynamics. Inflation has momentum (its own past predicts its present). Sales have delayed response (the past of an input predicts the present of the output). Many real series have both, and we model both with the ARDL framework.

3.3 Stationarity

Before any regression can be run on time-series data, the data needs to behave “nicely” over time. The technical property is covariance stationarity:

The mean is constant: \(\operatorname{E}[Y_t] = \mu\) for all \(t\).
The variance is constant: \(\operatorname{Var}(Y_t) = \sigma^2\) for all \(t\).
The covariance between \(Y_t\) and \(Y_{t-k}\) depends only on the lag \(k\), not on \(t\) itself.

The first two say the series doesn’t drift. The third says the dependence structure is the same at the beginning of the sample as at the end.

Why do we care? If the mean or variance is shifting over time, then statistics computed from one stretch of the sample don’t apply to another. A coefficient estimated on data from 1980-2000 wouldn’t generalize to 2000-2020, because the underlying process is moving under our feet. Stationarity is the assumption that lets us treat \(T\) observations from a single time series as informative about a single underlying process.

Think: Is the price of a stock stationary?

Typically no. Stock prices drift upward over decades (mean is not constant) and their level today depends on accumulated returns since IPO (variance grows over time). What is often approximately stationary is the return, the percentage change. That is why financial economists almost always work with returns rather than levels, they have transformed a non-stationary series into a stationary one.

The canonical non-stationary process is the random walk:

\[ Y_t = Y_{t-1} + v_t, \qquad v_t \sim \text{i.i.d.}(0, \sigma_v^2). \tag{3.1}\]

Substitute recursively and you get \(Y_t = Y_0 + \sum_{s=1}^{t} v_s\). The variance grows linearly with \(t\): \(\operatorname{Var}(Y_t \mid Y_0) = t \sigma_v^2\). So the random walk has a “memory” that never fades, every shock is permanent. Stock prices and GDP levels behave roughly like random walks; that is one reason we transform them (to returns or growth rates) before modeling.

Compare to the stationary AR(1) process \(Y_t = \phi Y_{t-1} + v_t\) with \(|\phi| < 1\). Here shocks decay geometrically: a unit shock at time \(t\) contributes \(\phi^k\) to \(Y_{t+k}\) and fades to zero. The series is mean-reverting; it doesn’t wander off.

3.4 Two Sources of Dynamics

With time-indexed data, today’s outcome can depend on (1) its own past, (2) the past of explanatory variables, or both. The three workhorse specifications are AR, DL, and ARDL.

3.4.1 Autoregressive (AR) Models

An AR(\(p\)) model regresses \(y_t\) on its own past values:

\[ y_t = \alpha + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \cdots + \phi_p y_{t-p} + u_t, \tag{3.2}\]

with \(u_t \sim \text{i.i.d.}(0, \sigma_u^2)\). The intuition is momentum: the system carries information forward from one period to the next. A shock today still moves \(y\) tomorrow because tomorrow’s \(y\) is built on today’s.

Concrete examples to anchor the idea:

GPA. Your GPA this quarter depends heavily on your GPA last quarter. The same study habits, the same skill stack, the same friends, everything that drove last quarter’s grade is mostly still in place this quarter.
Mood. How happy you are this hour is mostly how happy you were last hour. Emotional inertia produces hour-to-hour autocorrelation in self-reported wellbeing.
Crime rates. Crime in a neighborhood this year is correlated with crime last year through retaliation cycles, network effects, and slow-moving policing capacity.

The AR(1), just one lag, captures most of the persistence in macro and financial series. Adding more lags gives you AR(2), AR(3), …, letting the dynamics curve and oscillate rather than just decay. Stationarity for an AR(\(p\)) requires that the roots of \(1 - \phi_1 z - \cdots - \phi_p z^p = 0\) lie outside the unit circle. For the AR(1) this reduces to \(|\phi_1| < 1\).

3.4.2 Distributed Lag (DL) Models

A DL(\(q\)) model regresses \(y_t\) on contemporaneous and lagged values of an explanatory variable \(x\):

\[ y_t = \alpha + \beta_0 x_t + \beta_1 x_{t-1} + \cdots + \beta_q x_{t-q} + u_t. \tag{3.3}\]

The intuition is accumulation: an action today produces effects over multiple periods. Examples:

Advertising. Ad spend this month boosts sales this month (people click) and over the next several months (brand awareness builds).
Air pollution. A spike in pollution today raises respiratory hospital admissions over the next several weeks.
Fertilizer. Fertilizer applied to a field at planting affects crop yield months later at harvest.

The sequence \(\beta_0, \beta_1, \ldots, \beta_q\) traces out the lag distribution, how a one-unit change in \(x\) propagates into \(y\) at successively longer horizons. The sum \(\sum_{s=0}^q \beta_s\) is the long-run multiplier, the cumulative effect of a permanent unit change in \(x\).

3.4.3 ARDL Models: The Combination

An ARDL(\(p, q\)) model has both kinds of dynamics:

\[ y_t = \alpha + \sum_{j=1}^p \phi_j y_{t-j} + \sum_{s=0}^q \beta_s x_{t-s} + u_t. \tag{3.4}\]

Why combine them? AR captures momentum in \(y\) but doesn’t let an external driver matter. DL captures the effect of \(x\) but ignores \(y\)’s own inertia. ARDL captures both, and it usually does so with fewer total lags of either: the autoregressive structure absorbs the long tail of \(x\)’s effect.

The ARDL is the dominant specification in applied time-series work. Nearly every modern macro paper that estimates a dynamic equation uses some flavor of ARDL.

3.4.4 Multipliers in an ARDL

With an ARDL, the response of \(y\) to a change in \(x\) unfolds across periods. Three useful quantities, illustrated on the ARDL(1,1) Phillips curve

\[ \pi_t = \alpha + \phi_1 \pi_{t-1} + \beta_0 \text{DU}_t + \beta_1 \text{DU}_{t-1} + u_t, \tag{3.5}\]

where DU is the change in unemployment:

Multiplier	Formula	Meaning
Impact	\(\beta_0\)	One-unit rise in DU now \(\implies\) change in \(\pi\) this period
Interim (1 period)	\(\beta_0 + \beta_1\)	Cumulative change after one more period
Long-run	\(\dfrac{\beta_0 + \beta_1}{1 - \phi_1}\)	Total change once all dynamics finish playing out

The long-run multiplier divides by \((1 - \phi_1)\) because \(\phi_1 \pi_{t-1}\) keeps pulling the previous period’s response forward. If \(\phi_1 = 0.7\), then 70% of last period’s effect carries into this period, and the geometric series sums to \(1/(1-0.7) = 3.33\).

Think: With \(\hat\phi_1 = 0.56\), \(\hat\beta_0 = -0.69\), \(\hat\beta_1 = 0.32\), compute the three multipliers.

Impact: \(\hat\beta_0 = -0.69\). A one-point rise in \(\Delta\) unemployment today reduces inflation by 0.69 points immediately.
Interim: \(\hat\beta_0 + \hat\beta_1 = -0.69 + 0.32 = -0.37\). Cumulative effect after one more period.
Long-run: \(\dfrac{-0.69 + 0.32}{1 - 0.56} = \dfrac{-0.37}{0.44} \approx -0.84\). Total cumulative effect once dynamics finish.

The long-run effect is larger in absolute value than the interim because the autoregressive coefficient \(\phi_1 = 0.56\) keeps propagating past responses forward.

3.4.5 Model Selection: AIC and BIC

To pick the lag orders \(p\) and \(q\), fit several candidate models and compare information criteria:

\[ \text{AIC} = \ln(\hat\sigma^2) + \frac{2K}{T}, \qquad \text{BIC} = \ln(\hat\sigma^2) + \frac{K \ln T}{T}, \tag{3.6}\]

where \(K\) is the number of estimated parameters and \(T\) is the sample size. Both reward fit (lower \(\hat\sigma^2\)) and penalize complexity. BIC penalizes more heavily as \(T\) grows; it tends to pick simpler models, while AIC tends to pick larger ones. When the two disagree, the right call depends on whether you care more about parsimony (BIC) or forecast performance (AIC). Some shops compute both and report both.

3.5 When Errors Stay Autocorrelated

Even with the right dynamics in \(y\) and \(x\), the errors \(u_t\) may still be correlated across time. Reasons:

A lag of \(y\) or \(x\) that you didn’t include leaks into \(u_t\) and inherits its persistence.
Unobserved shocks (sentiment, weather, policy uncertainty) move slowly across periods.
Functional-form misspecification produces fit error that’s correlated through time.

Two parsimonious models for the error process show up everywhere:

\[ \textbf{AR(1) errors:} \quad e_t = \rho e_{t-1} + u_t, \qquad u_t \sim \text{i.i.d.}(0, \sigma_u^2), \quad |\rho| < 1. \tag{3.7}\]

\[ \textbf{MA(1) errors:} \quad e_t = u_t + \theta u_{t-1}. \tag{3.8}\]

The AR(1) error has the same shape as the AR(1) model above, just for the error term instead of for \(y\). The covariance decays geometrically: \(\operatorname{Cov}(e_t, e_{t-k}) = \rho^k \operatorname{Var}(e_t)\). A shock to the error has a long, fading tail.

The MA(1) error is different: \(e_t\) is a weighted sum of today’s and yesterday’s white-noise shocks. A shock \(u_{t-1}\) affects \(e_{t-1}\) and \(e_t\) and then disappears completely. Covariance is nonzero only at lag 1.

AR vs MA at a glance: AR errors have a long, geometrically decaying ACF. MA errors have an ACF that cuts off sharply after the lag order (\(k = q\) for MA(\(q\))).

Most applied work focuses on AR(1) errors because the long, slow tail of persistence in macro data fits the AR pattern better. We will follow the same convention; the tests we cover detect MA errors too, but we develop intuition with AR(1) in mind.

3.6 Consequences of Autocorrelated Errors

3.6.1 The Good News: OLS Coefficients Are Still Fine

If the regressors are strictly exogenous (no lagged dependent variable, no feedback from \(y\) to \(x\)), OLS remains unbiased and consistent even when the errors are serially correlated. The argument mirrors the heteroskedasticity case: write the slope formula,

\[ \hat\beta_1 = \beta_1 + \frac{\sum_t (x_t - \bar x)\, e_t}{\sum_t (x_t - \bar x)^2}, \tag{3.9}\]

take the expectation, and use \(\operatorname{E}[e_t \mid x] = 0\). Whether the \(e_t\)’s correlate with each other does not enter the calculation. So the point estimate is fine.

3.6.2 The Bad News: Standard Errors Are Wrong

The damage is in the standard error formula. Under the textbook regression model the OLS variance is

\[ \widehat{\operatorname{Var}}_{\text{OLS}}(\hat\beta_1) = \frac{\hat\sigma^2}{\sum_t (x_t - \bar x)^2}, \qquad \hat\sigma^2 = \frac{\sum_t \hat e_t^2}{T - k}. \tag{3.10}\]

That formula only keeps the diagonal terms of the variance of the residual numerator. The true variance under autocorrelated errors has both diagonal and off-diagonal pieces:

\[ \operatorname{Var}\!\left(\sum_t (x_t - \bar x)\, e_t\right) = \underbrace{\sum_t (x_t - \bar x)^2 \operatorname{Var}(e_t)}_{\text{diagonal: usual term}} + \underbrace{2 \sum_{t < s} (x_t - \bar x)(x_s - \bar x)\, \operatorname{Cov}(e_t, e_s)}_{\text{off-diagonal: missed by OLS}}. \tag{3.11}\]

The OLS formula keeps only the diagonal. When the errors are autocorrelated, the off-diagonal terms are nonzero and the OLS SE is inconsistent for the true sampling variance of \(\hat\beta_1\).

Which way is the OLS SE wrong?

For the typical macro / micro time series, autocorrelation is positive (\(\rho > 0\)) and the regressor is persistent (so \((x_t - \bar x)(x_s - \bar x)\) has the same sign at small lags). The off-diagonal terms are then positive on net, and the true SE is larger than the OLS formula reports. OLS overstates precision, \(t\)-statistics are too large, and we over-reject the null. Confidence intervals are too narrow.

The reverse direction (negative autocorrelation, OLS SE too large) is possible but rare.

3.6.3 The Special Case: Lagged Dependent Variables

There is one case where OLS goes from “unbiased but with wrong SE” to “biased and inconsistent.” It happens when the model includes a lag of \(y\) as a regressor and the errors are serially correlated. Concretely:

\[ y_t = \beta_0 + \beta_1 y_{t-1} + \cdots + e_t, \qquad e_t = \rho e_{t-1} + u_t. \]

Then \(y_{t-1}\) contains \(e_{t-1}\) (from the previous period’s equation), and \(e_t\) depends on \(e_{t-1}\) through the AR(1) structure. So \(\operatorname{Cov}(y_{t-1}, e_t) \neq 0\). Strict exogeneity fails, and OLS is biased. Correcting standard errors does not fix this. You have to model the error dynamics directly (GLS with AR(1) errors) or use instruments.

This is a real problem for ARDL models with autocorrelated errors. It is also why the Durbin-Watson test breaks down in those models: DW assumes strict exogeneity.

3.7 Visual Detection

Three plots cover most of what you need to spot serial correlation before running a formal test.

1. Time plot of the residuals. Plot \(\hat e_t\) against \(t\). With i.i.d. errors the sign flips frequently and there is no visible pattern. With positively autocorrelated errors the series wanders: long runs of positive residuals followed by long runs of negative residuals.

2. Lag-1 scatter. Plot \(\hat e_t\) on the vertical axis and \(\hat e_{t-1}\) on the horizontal. A cloud centered on the origin with no tilt indicates no autocorrelation. A positive slope is exactly the AR(1) coefficient \(\rho\); an upward tilt means today’s residual is predictable from yesterday’s.

3. Sample autocorrelation function (ACF). Define the sample autocorrelation at lag \(k\) as

\[ r_k = \frac{\sum_{t=k+1}^{T} (\hat e_t - \bar{\hat e})(\hat e_{t-k} - \bar{\hat e})}{\sum_{t=1}^{T} (\hat e_t - \bar{\hat e})^2}. \tag{3.12}\]

Under the null of i.i.d. residuals and large \(T\), \(r_k \approx \mathcal{N}(0, 1/T)\), so \(r_k\) is “significant” if \(|r_k| > 1.96/\sqrt{T}\). The ACF plot shows \(r_k\) as vertical bars at \(k = 0, 1, 2, \ldots\) with horizontal dashed lines at \(\pm 1.96/\sqrt T\).

How to read it:

Bar at lag 1 alone outside the band \(\implies\) probable AR(1) or MA(1).
Bars at lags 1 and 2 outside, geometrically decaying \(\implies\) probable AR(2) or AR(1) with larger \(\rho\).
Bar at lag 12 (monthly) or lag 4 (quarterly) outside \(\implies\) seasonal autocorrelation.
Sharp cutoff after lag \(q\) with no decay \(\implies\) MA(\(q\)) error.

The ACF is the single most useful diagnostic for identifying which dynamic structure is present.

3.8 Formal Tests

Plots are suggestive. To put a number in a paper you need a formal test. Two are standard.

3.8.1 Durbin-Watson Test

The Durbin-Watson statistic is the classical AR(1) test:

\[ \text{DW} = \frac{\sum_{t=2}^{T} (\hat e_t - \hat e_{t-1})^2}{\sum_{t=1}^{T} \hat e_t^2} \approx 2(1 - \hat\rho). \tag{3.13}\]

The approximation makes the intuition clear. With no autocorrelation, \(\hat\rho \approx 0\), so DW \(\approx 2\). With strong positive autocorrelation, \(\hat\rho \to 1\) and DW \(\to 0\). With strong negative autocorrelation, \(\hat\rho \to -1\) and DW \(\to 4\).

The catch with Durbin-Watson is that the exact distribution of DW depends on the particular values of the regressors in your sample. Durbin and Watson tabulated two critical values, \(d_L\) (lower) and \(d_U\) (upper), which bracket the worst-case and best-case designs. The decision rule for the lower-tail test (against \(\rho > 0\)):

Region	Decision
\(\text{DW} < d_L\)	Reject \(H_0\): positive autocorrelation
\(d_L \leq \text{DW} \leq d_U\)	Inconclusive
\(\text{DW} > d_U\)	Do not reject \(H_0\)

The mirror image works for negative autocorrelation: use \(4 - d_L\) and \(4 - d_U\).

The inconclusive zone is the price of avoiding a regressor-dependent critical value. A real fraction of samples land in the gap and leave the researcher without a decision.

Two important limitations of DW:

It only has power against AR(1). MA(1) is sometimes detected; AR(2) and seasonal patterns often slip past.
It assumes strict exogeneity of the regressors. With a lagged dependent variable on the right-hand side, DW is not valid (the statistic is biased toward 2 even when errors are autocorrelated). This is exactly the modeling situation where you most want to test for autocorrelation, so the limitation is severe.

3.8.2 Breusch-Godfrey Test

The Breusch-Godfrey (BG) test is the workhorse modern alternative. It is more flexible than DW (tests any lag order \(p\), not just AR(1)), and it works with lagged dependent variables.

The procedure.

Run OLS on the original model. Save residuals \(\hat e_t\).
Run an auxiliary regression of \(\hat e_t\) on the original regressors and on lagged residuals \(\hat e_{t-1}, \ldots, \hat e_{t-p}\):

\[ \hat e_t = \gamma_0 + \gamma_1 x_{1t} + \cdots + \gamma_k x_{kt} + \rho_1 \hat e_{t-1} + \cdots + \rho_p \hat e_{t-p} + v_t. \tag{3.14}\]

Compute \(R^2_{\text{aux}}\) from the auxiliary regression.
The test statistic is \(\text{BG} = T \cdot R^2_{\text{aux}}\), distributed \(\chi^2(p)\) under \(H_0: \rho_1 = \cdots = \rho_p = 0\).

Don’t drop the original regressors from the auxiliary

A common student error is to regress \(\hat e_t\) on lagged residuals alone. The BG procedure requires the original regressors in the auxiliary regression. That is what makes it valid in the presence of lagged dependent variables: the LDV is “controlled for” in the auxiliary, so the lagged-residual coefficients pick up genuine autocorrelation rather than the LDV-induced bias.

When BG works and when it fails.

AR(1) errors: BG with \(p = 1\) catches them cleanly.
MA(1) errors: BG also detects them (with \(p\) at least 1). MA structure puts the autocorrelation at low lags, which BG checks.
AR(2) and higher: works if you choose \(p\) large enough. Picking too small a \(p\) misses the structure.
Seasonal autocorrelation: works if you set \(p\) to include the seasonal lag (e.g., \(p \geq 12\) for monthly data).
Heteroskedasticity-only patterns (no actual serial correlation, just non-constant variance) do not show up, BG isn’t designed for that.

The practical lag-order choice: pick \(p\) large enough to cover the longest realistic autocorrelation but not so large that you burn degrees of freedom. For quarterly data, \(p = 4\) is a common default. For monthly data, \(p = 12\). Trust the residual ACF to guide the choice.

3.9 Fixing Inference

A test rejected. Now what? Two routes, parallel to the heteroskedasticity chapter.

3.9.1 Newey-West HAC Standard Errors

The fastest fix is to keep OLS and replace the variance estimator with one that is heteroskedasticity- and autocorrelation-consistent (HAC). The most common is the Newey-West estimator. It generalizes the robust SE from the heteroskedasticity chapter to also handle off-diagonal terms (autocorrelation), with a kernel that tapers far-lag autocovariances to zero.

The long-run variance estimate has two pieces:

\[ \hat S_L = \underbrace{\sum_t (x_t - \bar x)^2 \hat e_t^2}_{\text{White-style sum}} + \underbrace{2 \sum_{k=1}^{L} w_{k,L} \sum_{t > k} (x_t - \bar x)(x_{t-k} - \bar x)\, \hat e_t \hat e_{t-k}}_{\text{autocorrelation correction up to lag } L}. \tag{3.15}\]

The first piece is exactly White’s robust SE numerator. The second piece is the new bit: a weighted sum of sample autocovariances \(\hat e_t \hat e_{t-k}\) at lags \(k = 1, \ldots, L\). The weights \(w_{k,L} = 1 - k/(L+1)\) (the Bartlett kernel) taper from near 1 at lag 1 to 0 at lag \(L+1\), so far-lag noisy covariances get downweighted.

The bandwidth \(L\). \(L\) trades bias against variance. Too small \(\implies\) you miss real autocovariances and the HAC SE is still inconsistent. Too large \(\implies\) you include too many noisy sample autocovariances and the estimator becomes unstable.

In practice, software picks a sensible default (typically \(L \approx 4\) near \(T = 100\), growing slowly with \(T\)). The right sanity check is to report HAC SEs for a small range of \(L\) values and see whether they are stable. If the SE plateaus across \(L \in \{2, 4, 8\}\), you’re fine. If it drifts upward as \(L\) grows, the autocorrelation extends further than the default captures, and you should increase \(L\).

Think: Why does the bandwidth need to grow with \(T\)?

The HAC estimator works by averaging sample autocovariances. The number of independent pieces of information available at lag \(k\) is roughly \(T - k\). At fixed \(L\) and growing \(T\), each lag’s estimate becomes more precise, so more lags become “trustworthy” enough to include. The Newey-West (1994) rule, \(L = \lfloor 4(T/100)^{2/9} \rfloor\), has \(L\) grow slowly to balance bias (need enough lags) and variance (each lag is noisy).

In R:

Show R code

library(sandwich); library(lmtest)

model <- lm(y ~ x, data = ts_data)

# Default Newey-West with automatic bandwidth
coeftest(model, vcov. = NeweyWest(model))

# Explicit bandwidth L = 4
V <- NeweyWest(model, lag = 4, prewhite = FALSE)
coeftest(model, vcov. = V)

The point estimate doesn’t change. Only the standard error (and therefore the \(t\), \(p\), and CI) changes. The default reporting convention in modern macro/finance applied work is to publish HAC SEs side by side with OLS SEs.

3.9.2 Model the Autocorrelation

The alternative is to specify the error process and estimate by GLS. If you believe \(e_t = \rho e_{t-1} + u_t\), you can transform the data (\(y_t^* = y_t - \rho y_{t-1}\), \(x_t^* = x_t - \rho x_{t-1}\)) so that the transformed errors are white noise, then run OLS on the transformed data. Cochrane-Orcutt is an iterated version that estimates \(\rho\) from residuals, transforms, refits, re-estimates \(\rho\), and repeats until convergence.

Both inference and efficiency improve under GLS if the AR(1) assumption is correct. If it isn’t (the true process is MA or AR(2) or anything else), GLS may give worse estimates than OLS with HAC SEs. For that reason, applied work has shifted strongly toward HAC SEs as the default, they are robust to a wider class of error processes.

3.10 Unit-Root Testing: Dickey-Fuller

Stationarity is a maintained assumption for everything above. How do you check it?

The most common test is the Augmented Dickey-Fuller (ADF) test. It tests whether an AR(1)-style process has a unit root (random walk, non-stationary) against the alternative that it is stationary. Start from

\[ y_t = \alpha + \rho y_{t-1} + e_t. \tag{3.16}\]

Subtract \(y_{t-1}\) from both sides:

\[ \Delta y_t = \alpha + (\rho - 1) y_{t-1} + e_t \;=\; \alpha + \gamma y_{t-1} + e_t, \tag{3.17}\]

with \(\gamma = \rho - 1\). The null is \(H_0: \gamma = 0\) (unit root, non-stationary) versus \(H_1: \gamma < 0\) (stationary). Estimate by OLS and compute the \(t\) statistic on \(\gamma\).

The catch is that under \(H_0\) the \(t\) statistic does not have a standard \(t\) distribution. The estimator is non-standard because the regressor \(y_{t-1}\) is itself non-stationary. The critical values are different (and larger in magnitude) than the textbook \(t\) table. Dickey and Fuller tabulated them. Most software reports the right critical value automatically.

The “Augmented” in ADF refers to adding lagged differences \(\Delta y_{t-1}, \Delta y_{t-2}, \ldots\) to soak up any short-run autocorrelation in \(e_t\):

\[ \Delta y_t = \alpha + \gamma y_{t-1} + \sum_{j=1}^{p} \delta_j \Delta y_{t-j} + e_t. \tag{3.18}\]

This makes the residuals white, so the \(\gamma\) inference is clean. Pick \(p\) via AIC or by checking that ADF residuals show no autocorrelation.

Failing to reject is not the same as confirming

ADF has notoriously low power against alternatives that are nearly unit-root (e.g., \(\rho = 0.95\)). A failure to reject \(H_0\) does not prove non-stationarity; it just means the data don’t refute it. Robust applied practice is to use ADF alongside other tools: visual inspection, plotting the ACF (does it decay fast or stay near 1?), and an alternative test like KPSS (which has the null reversed).

If a series fails ADF, the standard fix is differencing: model \(\Delta y_t\) instead of \(y_t\). GDP levels are non-stationary; GDP growth (the first difference of log GDP) is stationary. Prices are non-stationary; returns are stationary. This is why so much applied work happens in differences.

3.11 Forecasting

The cleanest use of a fitted time-series model is forecasting. Given an estimated AR(2), say

\[ \hat y_t = \hat\delta + \hat\theta_1 y_{t-1} + \hat\theta_2 y_{t-2}, \]

the one-step-ahead forecast at time \(T\) is

\[ \hat y_{T+1} = \hat\delta + \hat\theta_1 y_T + \hat\theta_2 y_{T-1}. \tag{3.19}\]

Plug in the most recent observed values, multiply by the estimated coefficients, add the intercept.

Two-step-ahead is recursive:

\[ \hat y_{T+2} = \hat\delta + \hat\theta_1 \hat y_{T+1} + \hat\theta_2 y_T. \]

Notice that \(\hat y_{T+1}\) replaces \(y_{T+1}\) since the latter isn’t observed yet. So forecast errors compound: the variance of \(\hat y_{T+j}\) grows with \(j\).

Forecast intervals reflect that growing uncertainty:

\[ \hat y_{T+j} \pm t_c \cdot \hat\sigma_j, \tag{3.20}\]

where \(\hat\sigma_j\) increases with the horizon \(j\). One-period-ahead intervals are tight; long-horizon intervals can be very wide.

Think: An AR(2) model gives \(\hat\delta = 0.67\), \(\hat\theta_1 = 0.12\), \(\hat\theta_2 = -0.09\). If \(y_T = 0.8\) and \(y_{T-1} = -0.2\), what is \(\hat y_{T+1}\)?

\[ \hat y_{T+1} = 0.67 + 0.12(0.8) + (-0.09)(-0.2) = 0.67 + 0.096 + 0.018 = 0.784. \]

3.11.1 In-Sample vs Out-of-Sample

A model that fits in-sample isn’t guaranteed to forecast well. The standard validation strategy is to hold out the last fraction of the sample, fit the model on the earlier portion, generate forecasts for the held-out period, and compute forecast error metrics (RMSE, MAE) against the actual values.

Two common pitfalls:

Overfitting in-sample. Adding lags always improves the in-sample fit (a larger model can’t fit worse on the training data). AIC and BIC penalize this; out-of-sample evaluation penalizes it more honestly.
Look-ahead bias. If you use information that wouldn’t have been available at the forecast date (e.g., revised data instead of vintages, or fitting parameters using the whole sample), out-of-sample evaluation is contaminated. Real-time forecasting uses only what would have been known.

3.12 Worked Example: Inflation Forecasting

Fit an AR(2) to quarterly inflation:

	Estimate
\(\hat\delta\)	0.452
\(\hat\theta_1\)	0.623
\(\hat\theta_2\)	0.214

With \(\pi_T = 2.5\) and \(\pi_{T-1} = 3.0\), the one-step-ahead forecast is

\[ \hat\pi_{T+1} = 0.452 + 0.623(2.5) + 0.214(3.0) = 0.452 + 1.558 + 0.643 = 2.65. \]

Compare models by information criterion.

Model	AIC	BIC
AR(1)	245.23	252.45
AR(2)	242.18	252.67
AR(3)	241.95	255.71
AR(4)	240.23	257.26

By BIC, AR(1) wins (lowest at 252.45). By AIC, AR(4) wins (lowest at 240.23). The two criteria disagree because AIC’s lighter penalty rewards the additional fit of bigger models, while BIC’s heavier penalty prefers parsimony. In macroeconomic applications, BIC is the more common choice; the AR(1) selection is supported by both the BIC and Occam’s razor.

Run the BG test. With residuals from the chosen AR(1), regress them on the lagged \(\pi\), three lags of themselves (\(p = 3\)), and compute \(\text{BG} = T \cdot R^2_{\text{aux}}\). Compare to \(\chi^2(3)\). If BG rejects, the AR(1) is missing dynamics, try AR(2) or add an MA term.

Report HAC SEs. Whether or not BG rejects, modern applied practice reports Newey-West SEs alongside the homoskedastic ones. Bandwidth \(L = 4\) (for \(T \approx 100\)) or whatever the default rule recommends.

3.13 What’s Next

Time-series methods extend in several directions covered later in the course:

Multivariate dynamics: when you have two or more time series with mutual feedback, the natural generalization is a Vector AutoRegression (VAR). We cover VAR and Granger causality in a separate slide deck linked from the course home page.
Panel data with time dimension: when the time dependence interacts with cross-sectional heterogeneity, see Dynamic Panels for the Nickell-bias problem and the Arellano-Bond and Blundell-Bond GMM estimators.
Simultaneous systems in time series: see Simultaneous Equations for systems where two endogenous variables determine each other.
Cross-sectional inference: Heteroskedasticity handles the parallel problem of non-constant variance.