11  Hypothesis Testing

Is the Relationship Real, or Just Noise?

Inference
Hypothesis Testing
p-values
Author

Jake Anderson

Published

March 21, 2026

Modified

March 26, 2026

Abstract

A confidence interval tells us where a parameter plausibly lives. A hypothesis test asks a sharper question: is the data compatible with a specific claim, or should we reject it? This chapter develops the anatomy of a hypothesis test, walks through two-sided and one-sided examples using the food expenditure data, shows that three decision methods always agree, defines p-values precisely, and clarifies the tradeoff between Type I and Type II errors.

NotePrerequisites

You should be comfortable with the \(t\)-distribution and confidence intervals from Chapter 9.

11.1 From Estimation to Testing

From the food expenditure regression (\(N = 40\), \(df = 38\)): \(b_2 = 10.21\), \(\operatorname{se}(b_2) = 2.09\), 95% CI \(= [5.97, 14.45]\). A confidence interval tells us where \(\beta_2\) plausibly lives. A hypothesis test asks a yes/no question: is the data compatible with a specific claim about \(\beta_2\), or should we reject that claim?

Consider two regressions. Regression A has \(b_2 = 12.50\) with \(\operatorname{se}(b_2) = 8.40\); Regression B has \(b_2 = 2.10\) with \(\operatorname{se}(b_2) = 0.35\). Regression A has the bigger coefficient, but its estimate is noisy (\(b_2\) could easily be zero). Regression B’s estimate is small but precise. \(\implies\) We need a formal measure of evidence relative to noise.

Signal vs noise: The \(t\)-statistic measures signal (how far \(b_k\) is from the null value) relative to noise (\(\operatorname{se}(b_k)\)). A large \(|t|\) means the signal dominates the noise.

11.2 Anatomy of a Hypothesis Test

Every hypothesis test has five components:

Definition 11.1 (Five Components of a Hypothesis Test)  

  1. Null hypothesis (\(H_0: \beta_k = c\)): the claim we put on trial; it always contains an equality
  2. Alternative hypothesis (\(H_1\)): what we accept if we reject \(H_0\) (one-sided or two-sided)
  3. Test statistic: \(t = (b_k - c) / \operatorname{se}(b_k)\)
  4. Decision rule: rejection region or \(p\)-value threshold
  5. Conclusion: reject or do not reject \(H_0\)

The test statistic:

\[ t = \frac{b_k - c}{\operatorname{se}(b_k)} \sim t_{(N-2)} \quad \text{if } H_0 \text{ is true} \tag{11.1}\]

The numerator measures how far our estimate is from the null value; the denominator scales this distance by the estimation precision. If \(H_0\) is true, \(t\) should be close to zero. If \(H_0\) is false, \(|t|\) will tend to be large. A decision rule (rejection region or \(p\)-value threshold) determines whether \(|t|\) is large enough to reject, and we reach a conclusion.

Think of it as a trial: \(H_0\) is “innocent until proven guilty.” We need strong evidence to convict.

11.3 Two-Sided Tests

Testing \(H_0: \beta_2 = 0\) (does income affect food spending at all?). The test statistic is \(t = 10.21 / 2.09 = 4.88\). At \(\alpha = 0.05\), the critical value is \(t_c = 2.024\). Since \(|4.88| \ge 2.024\), we reject \(H_0\). There is a statistically significant relationship between income and food expenditure.

Testing \(H_0: \beta_2 = 7.5\) (consultant’s claim). The test statistic is \(t = (10.21 - 7.5) / 2.09 = 1.29\). Since \(|1.29| < 2.024\), we do not reject \(H_0\). The data are consistent with \(\beta_2 = 7.5\), but also with \(\beta_2 = 8.5\) (\(t = 0.82\)) or any value inside the confidence interval. Not rejecting does not prove the null is true.

11.4 Three Equivalent Decision Methods

For a two-sided test at level \(\alpha\), three methods always give the same answer:

Reject if \(|t| \ge t_c\).

For \(H_0: \beta_2 = 0\): \(|4.88| \ge 2.024\) \(\implies\) reject.

Reject if \(p \le \alpha\).

For \(H_0: \beta_2 = 0\): \(p = 0.00002 \le 0.05\) \(\implies\) reject.

Reject if the null value \(c\) falls outside the CI.

For \(H_0: \beta_2 = 0\): \(0 \notin [5.97, 14.45]\) \(\implies\) reject.

The three windows onto the same test always agree because they are algebraic rearrangements of the same inequality: \(|b_k - c| \ge t_c \cdot \operatorname{se}(b_k)\).

All three methods test the same condition. Start from the rejection region:

\[|t| \ge t_c \iff \left|\frac{b_k - c}{\operatorname{se}(b_k)}\right| \ge t_c \iff |b_k - c| \ge t_c \cdot \operatorname{se}(b_k)\]

The CI method: the interval \(b_k \pm t_c \cdot \operatorname{se}(b_k)\) excludes \(c\) exactly when \(|b_k - c| > t_c \cdot \operatorname{se}(b_k)\), which is the same condition.

The \(p\)-value method: \(p \le \alpha\) exactly when \(|t| \ge t_c\) (since \(t_c\) is defined as the value where the tail area equals \(\alpha/2\)).

All three are equivalent statements of \(|b_k - c| \ge t_c \cdot \operatorname{se}(b_k)\). \(\square\)

Interactive: rejection region visualizer

Adjust the significance level \(\alpha\) and choose one-tail or two-tail. Enter an observed \(t\)-statistic to see whether it falls in the rejection region.

Show code
viewof alpha_level = Inputs.range([0.01, 0.20], {value: 0.05, step: 0.01, label: "Significance level α"})
viewof tail_type = Inputs.radio(["Two-tail", "Right one-tail", "Left one-tail"], {label: "Test type", value: "Two-tail"})
viewof obs_t = Inputs.range([-5, 5], {value: 2.5, step: 0.1, label: "Observed t-statistic"})

t_critical = {
  // Approximate t critical value using normal approximation (good for df > 30)
  const p = tail_type === "Two-tail" ? 1 - alpha_level/2 : 1 - alpha_level;
  const t_val = Math.sqrt(-2 * Math.log(1 - p));
  const c0 = 2.515517, c1 = 0.802853, c2 = 0.010328;
  const d1 = 1.432788, d2 = 0.189269, d3 = 0.001308;
  return t_val - (c0 + c1*t_val + c2*t_val*t_val) / (1 + d1*t_val + d2*t_val*t_val + d3*t_val*t_val*t_val);
}

reject_decision = {
  if (tail_type === "Two-tail") return Math.abs(obs_t) >= t_critical;
  if (tail_type === "Right one-tail") return obs_t >= t_critical;
  return obs_t <= -t_critical;
}

t_density_data = {
  const pts = [];
  for (let x = -5; x <= 5; x += 0.02) {
    // t density approximation (normal for simplicity at df=38)
    const density = Math.exp(-0.5 * x * x) / Math.sqrt(2 * Math.PI);
    let shaded = false;
    if (tail_type === "Two-tail") {
      shaded = Math.abs(x) >= t_critical;
    } else if (tail_type === "Right one-tail") {
      shaded = x >= t_critical;
    } else {
      shaded = x <= -t_critical;
    }
    pts.push({x, density, shaded, shade_density: shaded ? density : 0});
  }
  return pts;
}

Plot.plot({
  width: 640,
  height: 380,
  x: {label: "t", domain: [-5, 5]},
  y: {label: "Density"},
  marks: [
    Plot.areaY(t_density_data, {x: "x", y: "density", fill: "#1E5A96", fillOpacity: 0.15}),
    Plot.areaY(t_density_data, {x: "x", y: "shade_density", fill: "#C41E3A", fillOpacity: 0.4}),
    Plot.line(t_density_data, {x: "x", y: "density", stroke: "#1E5A96", strokeWidth: 2}),
    Plot.ruleX([obs_t], {stroke: reject_decision ? "#C41E3A" : "#2E8B57", strokeWidth: 3}),
    Plot.ruleY([0]),
    Plot.text(
      [reject_decision ? "REJECT H₀" : "Do not reject H₀"],
      {x: obs_t, y: 0.42, fill: reject_decision ? "#C41E3A" : "#2E8B57", fontWeight: "bold", fontSize: 15, textAnchor: "middle"}
    ),
    Plot.text(
      [`tс = ±${t_critical.toFixed(3)}`],
      {x: 3.5, y: 0.35, fill: "#888", fontSize: 12}
    )
  ],
  caption: `α = ${alpha_level}, ${tail_type}. Critical value ≈ ${t_critical.toFixed(3)}. Observed t = ${obs_t.toFixed(1)}. ${reject_decision ? "Reject H₀." : "Do not reject H₀."}`
})
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Figure 11.1: Rejection region visualizer. The shaded area(s) show where we reject H₀. Enter your t-statistic to see the verdict.

Try it: Set \(\alpha = 0.05\) (two-tail) and slide the observed \(t\) from 0 to 3. The vertical line changes from green (“do not reject”) to red (“reject”) when \(|t|\) crosses the critical value near 1.96.

11.5 The \(p\)-Value

Definition 11.2 (\(p\)-Value) The \(p\)-value is the probability of observing a test statistic at least as extreme as the one we calculated, assuming \(H_0\) is true.

Small \(p\) means the observed \(t\) would be very unlikely under \(H_0\), providing strong evidence against it. Large \(p\) means the observed \(t\) is not unusual under \(H_0\), giving no reason to doubt it.

The direction of \(H_1\) determines which tail(s) to measure:

Computing \(p\)-values for each type of alternative.
Alternative \(p\)-value formula Tail(s)
\(H_1: \beta_k > c\) \(p = P(t_{(N-2)} \ge t)\) Right
\(H_1: \beta_k < c\) \(p = P(t_{(N-2)} \le t)\) Left
\(H_1: \beta_k \neq c\) \(p = 2 \cdot P(t_{(N-2)} \ge |t|)\) Both
CautionOne-tail \(p\)-value from software output

Software regression output reports the two-tail \(p\)-value for \(H_0: \beta_k = 0\) by default. For a one-tail test, divide by 2 only when the sign of \(t\) agrees with your alternative. If \(t\) has the wrong sign, the one-tail \(p\)-value is \(1 - p_{\text{two-tail}}/2 > 0.5\), and you cannot reject.

11.6 One-Sided Tests

When economic theory predicts the sign of the effect (income should increase food spending), a one-sided test concentrates all \(\alpha\) in a single tail. For \(H_1: \beta_k > c\), reject if \(t \ge t_{(1-\alpha, N-2)}\). For \(H_1: \beta_k < c\), reject if \(t \le -t_{(1-\alpha, N-2)}\). One-sided tests have a lower critical value (for example, 1.686 vs 2.024 at \(\alpha = 0.05\) with \(df = 38\)), making it easier to reject in the predicted direction. Use one-sided only when theory gives a clear directional prediction before seeing the data.

One-tail vs two-tail: One-tail is more powerful in the predicted direction, but cannot detect effects in the opposite direction. Use one-tail only with strong prior theoretical justification.

11.7 Type I and Type II Errors

The two types of errors in hypothesis testing.
\(H_0\) is actually true \(H_0\) is actually false
Reject \(H_0\) Type I error (probability \(= \alpha\)) Correct
Do not reject \(H_0\) Correct Type II error (probability \(= \beta\))

A Type I error (false positive) means rejecting a true \(H_0\); its probability is \(\alpha\), which we control by choosing the significance level. A Type II error (false negative) means failing to reject a false \(H_0\); its probability depends on the true parameter value and is not directly controlled. Power \(= 1 - \beta\) is the probability of correctly rejecting a false \(H_0\).

WarningThe tradeoff between Type I and Type II errors

There is always a tradeoff: lowering \(\alpha\) reduces false positives but increases false negatives. The choice of \(\alpha\) should reflect the relative costs of the two error types. In the supermarket example from the slides (testing whether income raises food spending by more than $5.50), the cost of a false positive (building an unprofitable store) is high, so a conservative \(\alpha = 0.01\) is appropriate.

\(\implies\) “Do not reject” is weaker than “reject.” It means the data cannot distinguish \(\beta_2\) from the null value, not that \(\beta_2\) is the null value. Always report the magnitude of \(b_k\) alongside the \(t\)-statistic; statistical significance tells you whether the effect is distinguishable from zero, not whether it is large enough to care about.

flowchart TD
    A["State H₀ and H₁"] --> B["Compute t = (bₖ − c) / se(bₖ)"]
    B --> C{"Two-tail or<br/>one-tail?"}
    C -->|Two-tail| D["Reject if |t| ≥ tс"]
    C -->|Right one-tail| E["Reject if t ≥ tс"]
    C -->|Left one-tail| F["Reject if t ≤ −tс"]
    D --> G{"Reject?"}
    E --> G
    F --> G
    G -->|Yes| H["Evidence against H₀<br/>at α level"]
    G -->|No| I["Cannot reject H₀<br/>(does NOT prove H₀)"]

    style A fill:#1E5A96,color:#fff
    style B fill:#1E5A96,color:#fff
    style H fill:#C41E3A,color:#fff
    style I fill:#2E8B57,color:#fff
Figure 11.2: Hypothesis testing decision flowchart.

11.8 Practice

A researcher estimates \(b_2 = 3.5\) with \(\operatorname{se}(b_2) = 1.4\) and \(N = 30\) (\(df = 28\)). Test \(H_0: \beta_2 = 0\) vs \(H_1: \beta_2 \neq 0\) at \(\alpha = 0.05\). The critical value is \(t_{(0.975, 28)} = 2.048\).

\(t = 3.5 / 1.4 = 2.5\). Since \(|2.5| = 2.5 \ge 2.048\), reject \(H_0\). Equivalently, the 95% CI is \(3.5 \pm 2.048 \times 1.4 = 3.5 \pm 2.87 = [0.63, 6.37]\); since \(0 \notin [0.63, 6.37]\), we reject. The \(p\)-value is \(2 \cdot P(t_{28} \ge 2.5) \approx 0.019 < 0.05\), confirming the rejection. All three methods agree.

Slides