Probability & Statistics

Coupon Collector Problem

Visualize the expected number of random draws required to collect all unique coupons in a set.

Coupon Collector's Problem

Concept Overview

The Coupon Collector's Problem is a classic probability theory problem that asks the following question: if each box of a brand of cereal contains a random coupon, and there are n different types of coupons, how many boxes do you expect to buy to collect all n coupons? Assuming each type of coupon is equally likely to be found in any given box, the problem explores the expected number of trials required to observe a complete set of outcomes.

Mathematical Definition

Let T be the total number of draws required to collect all n distinct coupons. We can break down the process into stages. Let ti be the time (number of draws) to collect the i-th new coupon after i-1 distinct coupons have already been collected.

Therefore, T = t1 + t2 + ... + tn.

When you have already collected i-1 coupons, there are n - (i - 1) new coupons remaining. The probability of drawing a new coupon on any single draw is:

pi = (n - (i - 1)) / n

The number of draws ti follows a geometric distribution with success probability pi. The expected value of a geometric random variable is 1/p. Thus, the expected number of draws to get the i-th new coupon is:

E[ti] = 1 / pi = n / (n - i + 1)

By the linearity of expectation, the expected total time to collect all n coupons is the sum of the expected times for each stage:

E[T] = E[t1] + E[t2] + ... + E[tn]
E[T] = (n / n) + (n / (n - 1)) + ... + (n / 1)
E[T] = n · (1/n + 1/(n-1) + ... + 1)
E[T] = n · Hn

Where Hn is the n-th harmonic number.

Key Concepts

Harmonic Number Approximation

For large values of n, the harmonic number Hn can be approximated using the natural logarithm:

Hn ≈ ln(n) + γ + 1/(2n)

Where γ (gamma) is the Euler-Mascheroni constant (approximately 0.57721). Consequently, the expected number of trials grows slightly faster than n·ln(n):

E[T] ≈ n · ln(n) + γ · n + 0.5

Variance

Since the stages ti are independent, the variance of T is the sum of their variances. For a geometric distribution, Var(ti) = (1 - pi) / pi2.

Var(T) ≈ (π2 / 6) · n2 - n · ln(n)

This implies that the standard deviation is on the order of n, which is smaller than the expected value n·ln(n) for large n, leading to a relatively sharp concentration of the distribution around the mean.

Historical Context

The Coupon Collector's Problem was formalized in the mathematical literature around the mid-20th century. However, its foundational concepts, particularly geometric distributions and harmonic series, were well understood long before. It was named after real-world promotions from the late 19th and early 20th centuries, where companies included collectible cards or coupons in product packaging to encourage repeat purchases.

Real-world Applications

  • Computer Science (Hashing & Caching): Understanding how many random items must be hashed before every bucket in a hash table contains at least one item, or analyzing cache hit rates for random eviction policies.
  • Network Routing: In gossip protocols or random walk routing, determining how many messages must be sent to ensure that all nodes in a network have received a piece of information.
  • Genetics and Ecology: Estimating the number of samples or specimens an ecologist must capture to observe all species in a given area, or determining required coverage depth in genome sequencing to ensure every part of the genome is sequenced at least once.

Related Concepts

  • Birthday Paradox — Another classical combinatorics problem exploring collisions.
  • Geometric Distribution — The underlying probability distribution for each stage ti.
  • Markov Chains — The process can be modeled as an absorbing Markov chain.

Experience it interactively

Adjust parameters, observe in real time, and build deep intuition with Riano’s interactive Coupon Collector Problem module.

Try Coupon Collector Problem on Riano →

More in Probability & Statistics