Hypothesis Testing

Visualize the relationship between significance level, effect size, and statistical power in a two-tailed Z-test.

Concept Overview

Hypothesis testing is a statistical method used to make decisions about a population based on a sample of data. It involves making a specific claim (the null hypothesis) and then calculating the probability of observing the sample data if that claim were true. This framework provides a principled way to weigh evidence and distinguish true effects from random noise.

Definition

A hypothesis test formally compares two mutually exclusive statements about population parameters:

H₀ (Null Hypothesis): Typically a statement of "no effect" or "no difference". For example, μ = μ₀.

H₁ (Alternative Hypothesis): The statement representing the effect we suspect is true. For example, μ ≠ μ₀ (two-tailed test).

Test Statistic (e.g., Z-statistic):

Z = (X - μ₀) / (σ / √n)

We compare the test statistic to critical values determined by the significance level (α) to decide whether to reject H₀.

Key Concepts

Type I Error (False Positive, α)

A Type I error occurs when we reject a true null hypothesis. The probability of this happening is denoted by α (the significance level). In the visualization, this is the red-shaded area under the null distribution's tails. By setting α (commonly to 0.05), we explicitly control our tolerance for false positives.

Type II Error (False Negative, β)

A Type II error occurs when we fail to reject a false null hypothesis. The probability of this is denoted by β. In the visualization, this is the yellow-shaded area under the true (alternative) distribution that falls within the non-rejection region of the null hypothesis.

Statistical Power (1 - β)

Power is the probability of correctly rejecting a false null hypothesis. It is visually represented by the unshaded area under the true distribution that falls outside the critical values. Power depends on several factors:

Effect Size: Larger true difference from μ₀ increases power.
Sample Size (n): Larger n decreases standard error (narrower distributions), increasing power.
Variance (σ²): Smaller population variance increases power.
Significance Level (α): Higher α increases power, but at the cost of more Type I errors.

Historical Context

Modern hypothesis testing is a synthesis of two historically distinct approaches developed in the early 20th century. Ronald Fisher introduced significance testing, focusing on calculating p-values to evaluate evidence against a single null hypothesis. Jerzy Neyman and Egon Pearson later introduced the concepts of alternative hypotheses, Type I and Type II errors, and statistical power to create a formal decision-making framework. Today's standard practice often blends elements of both schools, though sometimes controversially.

Applications

Medicine & Clinical Trials: Determining if a new drug is significantly more effective than a placebo (or existing treatment) while controlling the risk of approving an ineffective drug (Type I error).
A/B Testing in Tech: Comparing two versions of a webpage or app to see which leads to higher user engagement or conversion rates, ensuring observed differences aren't just statistical noise.
Quality Control: Testing samples of manufactured products to decide whether an entire batch meets safety or performance specifications.
Scientific Research: Evaluating experimental data across fields like psychology, economics, and biology to establish evidence for new theories.

Related Concepts

Central Limit Theorem — justifies using normal distributions for test statistics by showing sample means are normally distributed for large n.
Probability Distributions — the underlying mathematical models (like Normal, t, Chi-square) used to compute p-values and critical regions.

Hypothesis Testing