Economics tells us that higher income should increase food spending, but it does not tell us by how much. Econometrics bridges this gap: it combines economic theory, data, and statistical methods to estimate magnitudes, test hypotheses, and make predictions. This chapter introduces the discipline through a running example and surveys the types of data economists work with.
2.1 Why Econometrics?
Suppose a household earns an extra $100 per week. Economic theory says food spending should go up (food is a normal good), but by how much? $5? $15? $40? Theory gives us the direction; econometrics gives us the number. That number is what policymakers, businesses, and researchers actually need.
Econometrics = economics + statistics. The “metrics” part (Greek metron, “measure”) refers to measuring economic relationships from data.
Econometrics uses economic theory, data, and statistical tools to do three things: estimate economic relationships (how much does \(x\) affect \(y\)?), predict economic outcomes (what will GDP be next quarter?), and test hypotheses (does a minimum wage increase reduce employment?). You already know statistics from a prerequisite course. Econometrics adds an economic model to guide which statistics to compute and how to interpret them.
\(\implies\) Decision-makers across economics need magnitudes, not just directions. The Federal Reserve needs to know how much interest rates should change, a business owner needs to know how much revenue $1,000 in advertising generates, and a university needs to know how many students it will lose if tuition rises by $500. The values that answer these questions (elasticities, multipliers, marginal effects) are unknown parameters. Econometrics estimates them from data.
2.2 The Food Expenditure Example
Consider a dataset of 40 households from southern Australia, with weekly food expenditure and weekly income. A scatter plot shows a positive relationship: households with higher income tend to spend more on food. But there is a lot of spread. Two households earning $2,000 per week might spend very different amounts.
We specify a linear econometric model:
Definition 2.1 (Linear Econometric Model)\[
y = \beta_1 + \beta_2 x + e
\tag{2.1}\]
where \(y\) is the dependent variable, \(x\) is the independent (explanatory) variable, \(\beta_1\) and \(\beta_2\) are unknown population parameters, and \(e\) is the random error term.
In this example, \(y\) is weekly food expenditure, \(x\) is weekly income (in $100 units), \(\beta_2\) is the slope (extra food spending per additional $100 of income), and \(e\) captures household size, dietary preferences, location, etc.
The model splits every observation into a systematic component (\(\beta_1 + \beta_2 x\), the predictable part from economic theory) and a random error (\(e\), the unpredictable part). The error explains why two households with the same income spend different amounts on food.
Using data from the 40 households, econometric methods produce estimates \(\hat{\beta}_1 \approx 83.42\) and \(\hat{\beta}_2 \approx 10.21\). The estimated slope means: for each additional $100 in weekly income, a household spends roughly $10.21 more on food per week.
Interactive: food expenditure scatter plot
Use the slider below to see how the scatter plot and fitted regression line change as the sample size grows. Data are generated from \(y = 83 + 10x + \text{noise}\), matching the food expenditure example.
Show code
viewof n_obs = Inputs.range([10,500], {value:40,step:1,label:"Sample size (N)"})food_seed =42food_data = {const rng = d3.randomLcg(food_seed);const normal = d3.randomNormal.source(rng)(0,80);returnArray.from({length: n_obs}, (_, i) => {const x =5+ (35* (i +rng())) / n_obs;const y =83+10* x +normal();return {income: x,food: y}; });}food_regression = {const n = food_data.length;const xbar = d3.mean(food_data, d => d.income);const ybar = d3.mean(food_data, d => d.food);const num = d3.sum(food_data, d => (d.income- xbar) * (d.food- ybar));const den = d3.sum(food_data, d => (d.income- xbar) **2);const b2 = num / den;const b1 = ybar - b2 * xbar;return {b1, b2};}Plot.plot({width:640,height:400,x: {label:"Weekly income ($100s)",domain: [3,42]},y: {label:"Weekly food expenditure ($)",domain: [0,600]},marks: [ Plot.dot(food_data, {x:"income",y:"food",fill:"#1E5A96",opacity:0.6,r:4}), Plot.line( [{income:3,food: food_regression.b1+ food_regression.b2*3}, {income:42,food: food_regression.b1+ food_regression.b2*42}], {x:"income",y:"food",stroke:"#C41E3A",strokeWidth:2.5} ), Plot.text([`b₁ = ${food_regression.b1.toFixed(1)}, b₂ = ${food_regression.b2.toFixed(2)}`], {x:35,y:80,fill:"#C41E3A",fontSize:13,fontWeight:"bold"}) ]})
(a)
(b)
(c)
(d)
(e)
Figure 2.1: Food expenditure vs income. Move the slider to see how more data pins down the fitted line.
Notice how the fitted line stabilizes as \(N\) grows. With \(N = 10\), the line jumps around; by \(N = 200\), it settles near the true slope of 10. This is the law of large numbers at work.
2.3 Correlation Is Not Causation
Suppose we estimate \(\text{GRADE} = \beta_1 + \beta_2 \, \text{SKIP} + e\) and find \(\hat{\beta}_2 < 0\): students who skip more classes tend to get lower grades. Does skipping cause lower grades? Not necessarily. Students who skip may also work long hours, be less motivated, or face personal circumstances affecting both attendance and grades. These omitted variables sit in the error term \(e\) and are correlated with both SKIP and GRADE. The estimate \(\hat{\beta}_2\) captures the association between skipping and grades, not necessarily the causal effect of skipping alone.
WarningCorrelation does not establish causation
Observing that \(x\) and \(y\) move together does not prove that \(x\) causes \(y\). A third variable may drive both, or the direction of causation may be reversed.
\(\implies\) Distinguishing correlation from causation is a central challenge of econometrics. Much of this course builds tools to move from association toward causation: adding control variables (multiple regression), instrumental variables, and panel methods.
2.4 Data Types
The way data are collected affects what conclusions we can draw.
Cross-section data are observations on many units at one point in time (for example, wages, education, and demographics for thousands of workers in a single month). We see variation across individuals but have no “before and after.”
Time-series data record the same variable at regular intervals over time (quarterly GDP, monthly unemployment, daily stock prices). Observations are typically not independent, because today’s value depends on yesterday’s.
Panel (longitudinal) data observe the same units over multiple time periods. This combines the strengths of cross-section and time-series data: variation across units and variation within units over time. Panel data are powerful for controlling unobserved differences between units.
In practice: Most introductory econometrics courses (including this one) focus on cross-section data. Time series and panel methods appear in later courses.
2.5 The Research Pipeline
Econometric analysis follows a structured pipeline. The flowchart below summarizes the six stages:
flowchart TB
A["1. Economic Theory"] --> B["2. Econometric Model"]
B --> C["3. Data Collection"]
C --> D["4. Estimation"]
D --> E["5. Diagnostics"]
E --> F["6. Inference"]
E -->|"Assumptions<br/>violated?"| B
style A fill:#1E5A96,color:#fff
style B fill:#1E5A96,color:#fff
style C fill:#1E5A96,color:#fff
style D fill:#2E8B57,color:#fff
style E fill:#D4A84B,color:#fff
style F fill:#2E8B57,color:#fff
Figure 2.2: The econometric research pipeline: from theory to conclusions.
Economic theory identifies variables and hypothesized relationships, (2) an econometric model specifies a functional form and adds an error term, (3) data are collected, (4) estimation produces numerical estimates, (5) diagnostics check whether assumptions hold, and (6) inference draws conclusions, makes predictions, and tests hypotheses. If diagnostics reveal assumption violations, we return to step 2 and modify the model. This course walks through each step.
2.6 Practice
A researcher estimates \(\text{RENT} = 125.9 + 0.525 \cdot \text{PCTURBAN} + 1.521 \cdot \text{MDHOUSE} + e\) using state-level data. Identify the dependent variable, the independent variables, the systematic component, and the random error. What does \(\hat{\beta}_{\text{MDHOUSE}} = 1.521\) mean?
TipShow Solution
The dependent variable is RENT (median state rent). The independent variables are PCTURBAN (percent urban population) and MDHOUSE (median house value in $1,000 units). The systematic component is \(125.9 + 0.525 \cdot \text{PCTURBAN} + 1.521 \cdot \text{MDHOUSE}\). The random error \(e\) captures everything else affecting rent (local amenities, zoning, etc.). The coefficient 1.521 means: holding PCTURBAN constant, each additional $1,000 in median house value is associated with $1.52 higher median rent.