Entropy & Information Theory

Visualize Shannon entropy, probability distributions, and information content for binary and ternary systems.

Entropy & Information Theory

Concept Overview

Information theory, developed by Claude Shannon in 1948, provides a mathematical framework for quantifying information, communication, and uncertainty. The foundational concept is Entropy (denoted H), which measures the average uncertainty or "surprise" associated with outcomes of a random variable. Intuitively, a system has high entropy when its outcomes are hard to predict, and low entropy when its outcomes are almost certain.

Mathematical Definition

The information content (or surprisal) of a single event depends inversely on its probability. A highly probable event carries very little information; a rare event carries a large amount. For an event with probability p, its information content in bits is:

I(p) = −log₂(p)

A fair coin flip (p = 0.5) conveys exactly −log₂(0.5) = 1 bit. Shannon Entropy is the expected value of the information content over all outcomes in a distribution — the average surprise of the system:

H = −Σ_i p_i log₂(p_i)

For a binary system with outcome probabilities p and (1 − p), the entropy takes the form of a concave, symmetric curve that peaks at p = 0.5:

H(p) = −p log₂(p) − (1 − p) log₂(1 − p)

Key Concepts

Minimum Entropy (H = 0): Occurs when one outcome has probability 1 and all others are 0. There is total certainty and no surprise.
Maximum Entropy: Occurs when all n outcomes are equally likely (p_i = 1/n). Maximum entropy equals log₂(n) bits, reflecting a uniform distribution as the most uncertain.
Bits as a Unit: Using log₂ means entropy measures the average number of binary yes/no questions needed to learn the outcome.

Historical Context

Claude Shannon's 1948 paper "A Mathematical Theory of Communication" introduced the modern notion of entropy for information sources. By drawing analogies with thermodynamic entropy, Shannon showed how to quantify the fundamental limits of data compression and reliable communication over noisy channels, founding the field of information theory.

Real-world Applications

Data Compression: Shannon's source coding theorem establishes entropy as the fundamental lower bound on the average number of bits needed to losslessly compress a sequence of symbols.
Machine Learning: Cross-Entropy and Information Gain (derived from entropy) are central to training classifiers and building decision trees.
Cryptography: High entropy is essential for generating secure cryptographic keys, ensuring unpredictability against attackers.

Related Concepts

Cross-Entropy: Measures the average bits needed when encoding data from a true distribution using a "wrong" model distribution.
KL Divergence: Quantifies how much one probability distribution diverges from a reference distribution using entropy-like terms.
Mutual Information: Measures how much knowing one random variable reduces uncertainty (entropy) about another.

Entropy & Information Theory