K-Means Clustering

Partitioning data into clusters by iteratively assigning points and updating centroids.

Concept Overview

K-Means is the most widely used unsupervised learning algorithm. Given a set of unlabeled data points, it partitions them into k groups (clusters) such that points within each cluster are as close as possible to the cluster's center (centroid). The algorithm is remarkably simple — alternating between two steps: assigning each point to its nearest centroid, then moving each centroid to the mean of its assigned points — yet it is powerful enough to reveal structure in complex datasets.

Mathematical Definition

Given a dataset of n points { x₁, x₂, …, x_n } in ℝ^d, K-Means aims to find k centroids { μ₁, μ₂, …, μ_k } that minimize the within-cluster sum of squares (WCSS):

J = Σ_i=1^k Σ(x ∈ C_i) ‖x − μ_i‖²

where:

C_i = set of points assigned to cluster i

μ_i = (1/|C_i|) · Σ(x ∈ C_i) x (centroid of cluster i)

‖·‖ = Euclidean distance

Lloyd's Algorithm (Standard K-Means)

The standard K-Means algorithm, attributed to Stuart Lloyd (1957, published 1982), iterates between two steps until convergence:

1. Initialize k centroids randomly

2. Repeat until convergence:

a. Assignment: C_i = { x : ‖x − μ_i‖ ≤ ‖x − μ_j‖ ∀ j ≠ i }

b. Update: μ_i = (1/|C_i|) · Σ(x ∈ C_i) x

3. Converged when assignments no longer change

Each iteration is guaranteed to decrease (or maintain) the objective function J, and since there are finitely many possible assignments, the algorithm always terminates. However, it may converge to a local minimum rather than the global optimum.

Key Concepts

Choosing k

The number of clusters k must be specified in advance — it is not learned from the data. Common methods for selecting k include:

Elbow method: Plot WCSS against k and look for the "elbow" — the point where adding more clusters yields diminishing returns.
Silhouette score: Measures how similar each point is to its own cluster vs. neighboring clusters. Higher is better.
Gap statistic: Compares the WCSS to that of a uniformly distributed reference dataset.

Initialization Sensitivity

K-Means is sensitive to initial centroid placement. Poor initialization can lead to suboptimal clusterings. The K-Means++ initialization (Arthur & Vassilvitskii, 2007) addresses this by choosing initial centroids that are spread far apart, leading to provably better results on average.

Convergence and Complexity

Each iteration of K-Means runs in O(nkd) time, where n is the number of points, k the number of clusters, and d the dimensionality. In practice, K-Means converges in a small number of iterations (often under 20), making it highly efficient for large datasets. However, finding the globally optimal solution is NP-hard in general.

Limitations

Spherical clusters only: K-Means assumes clusters are convex and roughly spherical. It struggles with elongated, ring-shaped, or irregularly shaped clusters.
Equal-size bias: The algorithm tends to produce clusters of roughly equal spatial extent, even when the true clusters vary in size.
Outlier sensitivity: A single distant outlier can significantly shift a centroid, distorting cluster assignments.
Requires k in advance: Unlike density-based methods (e.g., DBSCAN), k must be chosen before running the algorithm.

Historical Context

The K-Means algorithm was independently discovered multiple times. Stuart Lloyd devised it in 1957 at Bell Labs for pulse-code modulation, though it wasn't published until 1982. Edward Forgy described a similar procedure in 1965, and James MacQueen coined the name "K-Means" in 1967. Hugo Steinhaus proposed a related method as early as 1956.

Despite its simplicity and age, K-Means remains one of the most frequently used clustering algorithms. Its influence extends beyond machine learning into signal processing (vector quantization), image compression, and even astronomy (classifying stellar spectra).

Real-world Applications

Customer segmentation: Grouping customers by purchasing behavior to tailor marketing strategies.
Image compression: Reducing the number of colors in an image by clustering pixel values and replacing each with its centroid.
Document clustering: Grouping similar documents or articles for topic discovery in large text corpora.
Anomaly detection: Points far from any centroid may be flagged as anomalies or outliers.
Feature learning: K-Means can serve as a simple unsupervised feature extractor, mapping data points to cluster distances as input features for supervised models.

Related Concepts

Gradient Descent — K-Means can be viewed as coordinate descent on the WCSS objective, alternating between cluster assignments and centroid updates
Probability Distributions — Gaussian Mixture Models generalize K-Means by modeling clusters as probability distributions rather than hard assignments
Linear Transformations — PCA (principal component analysis) is often used to reduce dimensionality before applying K-Means

K-Means Clustering