Hypothesis Testing
Visualize the relationship between significance level, effect size, and statistical power in a two-tailed Z-test.
Hypothesis Testing
Concept Overview
Hypothesis testing is a statistical method used to make decisions about a population based on a sample of data. It involves making a specific claim (the null hypothesis) and then calculating the probability of observing the sample data if that claim were true. This framework provides a principled way to weigh evidence and distinguish true effects from random noise.
Definition
A hypothesis test formally compares two mutually exclusive statements about population parameters:
We compare the test statistic to critical values determined by the significance level (α) to decide whether to reject H0.
Key Concepts
Type I Error (False Positive, α)
A Type I error occurs when we reject a true null hypothesis. The probability of this happening is denoted by α (the significance level). In the visualization, this is the red-shaded area under the null distribution's tails. By setting α (commonly to 0.05), we explicitly control our tolerance for false positives.
Type II Error (False Negative, β)
A Type II error occurs when we fail to reject a false null hypothesis. The probability of this is denoted by β. In the visualization, this is the yellow-shaded area under the true (alternative) distribution that falls within the non-rejection region of the null hypothesis.
Statistical Power (1 - β)
Power is the probability of correctly rejecting a false null hypothesis. It is visually represented by the unshaded area under the true distribution that falls outside the critical values. Power depends on several factors:
- Effect Size: Larger true difference from μ0 increases power.
- Sample Size (n): Larger n decreases standard error (narrower distributions), increasing power.
- Variance (σ2): Smaller population variance increases power.
- Significance Level (α): Higher α increases power, but at the cost of more Type I errors.
Historical Context
Modern hypothesis testing is a synthesis of two historically distinct approaches developed in the early 20th century. Ronald Fisher introduced significance testing, focusing on calculating p-values to evaluate evidence against a single null hypothesis. Jerzy Neyman and Egon Pearson later introduced the concepts of alternative hypotheses, Type I and Type II errors, and statistical power to create a formal decision-making framework. Today's standard practice often blends elements of both schools, though sometimes controversially.
Applications
- Medicine & Clinical Trials: Determining if a new drug is significantly more effective than a placebo (or existing treatment) while controlling the risk of approving an ineffective drug (Type I error).
- A/B Testing in Tech: Comparing two versions of a webpage or app to see which leads to higher user engagement or conversion rates, ensuring observed differences aren't just statistical noise.
- Quality Control: Testing samples of manufactured products to decide whether an entire batch meets safety or performance specifications.
- Scientific Research: Evaluating experimental data across fields like psychology, economics, and biology to establish evidence for new theories.
Related Concepts
- Central Limit Theorem — justifies using normal distributions for test statistics by showing sample means are normally distributed for large n.
- Probability Distributions — the underlying mathematical models (like Normal, t, Chi-square) used to compute p-values and critical regions.
Experience it interactively
Adjust parameters, observe in real time, and build deep intuition with Riano’s interactive Hypothesis Testing module.
Try Hypothesis Testing on Riano →