Maximum Likelihood Estimation

Find the parameters of a probability distribution that best explain the observed data.

Maximum Likelihood Estimation

Concept Overview

Maximum Likelihood Estimation (MLE) is a foundational method in statistics for estimating the parameters of an assumed probability distribution, given some observed data. The goal is simple: find the parameter values that make the observed data "most probable" under the chosen statistical model. If we have a set of data points, MLE seeks the distribution curve that best fits those points by maximizing the likelihood function.

Mathematical Definition

Let X = {x₁, x₂, ..., x_n} be a set of independent and identically distributed (i.i.d.) observations drawn from a probability distribution with an unknown parameter vector θ. The likelihood function L(θ | X) is the joint probability of the observed data as a function of the parameters:

L(θ | X) = f(x₁ | θ) · f(x₂ | θ) · ... · f(x_n | θ) = Π_i=1ⁿ f(x_i | θ)

Because products of many small probabilities can cause numerical underflow and are harder to differentiate, we typically maximize the log-likelihood function, l(θ | X), which transforms the product into a sum:

l(θ | X) = ln( L(θ | X) ) = Σ_i=1ⁿ ln( f(x_i | θ) )

The Maximum Likelihood Estimate, denoted as θ, is the value that maximizes this function: θ = argmax_θ l(θ | X).

Key Concepts

Likelihood vs. Probability

While mathematically related, they answer different questions. Probability predicts future data given a known set of parameters (e.g., "Given a fair coin, what is the probability of 5 heads?"). Likelihood works backwards: it evaluates the plausibility of different parameters given the already observed data (e.g., "Given we saw 5 heads, what is the likelihood the coin is fair vs. biased?").

Properties of MLE

Consistency: As the sample size n approaches infinity, the MLE converges in probability to the true parameter value.
Asymptotic Normality: For large samples, the distribution of the MLE approaches a normal distribution centered on the true parameter.
Efficiency: Among all consistent estimators, the MLE achieves the lowest possible variance as sample size increases (it reaches the Cramér-Rao lower bound).
Invariance: If θ is the MLE of θ, then for any function g, the MLE of g(θ) is g(θ).

The Normal Distribution Example

For a set of data presumed to come from a normal distribution with unknown mean μ and variance σ², maximizing the log-likelihood yields the familiar formulas for sample mean and sample variance:

μ = (1/n) Σ_i=1ⁿ x_i

σ² = (1/n) Σ_i=1ⁿ (x_i - μ)²

Note that the MLE for variance divides by n rather than (n-1), making it slightly biased for small samples, though it is consistent asymptotically.

Historical Context

The method of maximum likelihood was heavily promoted, formalized, and named by British statistician Ronald A. Fisher between 1912 and 1922. While earlier mathematicians like Carl Friedrich Gauss and Pierre-Simon Laplace had used similar principles (often framed as inverse probability), Fisher established MLE as a unified, powerful framework that forms the bedrock of modern frequentist statistics.

Real-world Applications

Machine Learning: Training many models (like Logistic Regression and Neural Networks using cross-entropy loss) is fundamentally equivalent to performing Maximum Likelihood Estimation.
Genetics: Estimating mutation rates and tracing phylogenetic trees by finding the evolutionary parameters that make the observed genetic sequences most likely.
Econometrics: Fitting structural models to economic data to estimate demand elasticity or market volatility.
Signal Processing: Estimating the parameters of a signal (like its frequency or phase) corrupted by Gaussian noise.

Related Concepts

Probability Distributions: Understanding the underlying probability density functions (PDFs) that define the likelihood function.
Bayes' Theorem: Bayesian inference extends MLE by incorporating prior beliefs about the parameters (resulting in Maximum A Posteriori or MAP estimation).
Law of Large Numbers: Explains why MLE estimates become more accurate (consistent) as more data is collected.

Maximum Likelihood Estimation