K-Means Clustering
Partitioning data into clusters by iteratively assigning points and updating centroids.
K-Means Clustering
Concept Overview
K-Means is the most widely used unsupervised learning algorithm. Given a set of unlabeled data points, it partitions them into k groups (clusters) such that points within each cluster are as close as possible to the cluster's center (centroid). The algorithm is remarkably simple — alternating between two steps: assigning each point to its nearest centroid, then moving each centroid to the mean of its assigned points — yet it is powerful enough to reveal structure in complex datasets.
Mathematical Definition
Given a dataset of n points { x1, x2, …, xn } in ℝd, K-Means aims to find k centroids { μ1, μ2, …, μk } that minimize the within-cluster sum of squares (WCSS):
Lloyd's Algorithm (Standard K-Means)
The standard K-Means algorithm, attributed to Stuart Lloyd (1957, published 1982), iterates between two steps until convergence:
Each iteration is guaranteed to decrease (or maintain) the objective function J, and since there are finitely many possible assignments, the algorithm always terminates. However, it may converge to a local minimum rather than the global optimum.
Key Concepts
Choosing k
The number of clusters k must be specified in advance — it is not learned from the data. Common methods for selecting k include:
- Elbow method: Plot WCSS against k and look for the "elbow" — the point where adding more clusters yields diminishing returns.
- Silhouette score: Measures how similar each point is to its own cluster vs. neighboring clusters. Higher is better.
- Gap statistic: Compares the WCSS to that of a uniformly distributed reference dataset.
Initialization Sensitivity
K-Means is sensitive to initial centroid placement. Poor initialization can lead to suboptimal clusterings. The K-Means++ initialization (Arthur & Vassilvitskii, 2007) addresses this by choosing initial centroids that are spread far apart, leading to provably better results on average.
Convergence and Complexity
Each iteration of K-Means runs in O(nkd) time, where n is the number of points, k the number of clusters, and d the dimensionality. In practice, K-Means converges in a small number of iterations (often under 20), making it highly efficient for large datasets. However, finding the globally optimal solution is NP-hard in general.
Limitations
- Spherical clusters only: K-Means assumes clusters are convex and roughly spherical. It struggles with elongated, ring-shaped, or irregularly shaped clusters.
- Equal-size bias: The algorithm tends to produce clusters of roughly equal spatial extent, even when the true clusters vary in size.
- Outlier sensitivity: A single distant outlier can significantly shift a centroid, distorting cluster assignments.
- Requires k in advance: Unlike density-based methods (e.g., DBSCAN), k must be chosen before running the algorithm.
Historical Context
The K-Means algorithm was independently discovered multiple times. Stuart Lloyd devised it in 1957 at Bell Labs for pulse-code modulation, though it wasn't published until 1982. Edward Forgy described a similar procedure in 1965, and James MacQueen coined the name "K-Means" in 1967. Hugo Steinhaus proposed a related method as early as 1956.
Despite its simplicity and age, K-Means remains one of the most frequently used clustering algorithms. Its influence extends beyond machine learning into signal processing (vector quantization), image compression, and even astronomy (classifying stellar spectra).
Real-world Applications
- Customer segmentation: Grouping customers by purchasing behavior to tailor marketing strategies.
- Image compression: Reducing the number of colors in an image by clustering pixel values and replacing each with its centroid.
- Document clustering: Grouping similar documents or articles for topic discovery in large text corpora.
- Anomaly detection: Points far from any centroid may be flagged as anomalies or outliers.
- Feature learning: K-Means can serve as a simple unsupervised feature extractor, mapping data points to cluster distances as input features for supervised models.
Related Concepts
- Gradient Descent — K-Means can be viewed as coordinate descent on the WCSS objective, alternating between cluster assignments and centroid updates
- Probability Distributions — Gaussian Mixture Models generalize K-Means by modeling clusters as probability distributions rather than hard assignments
- Linear Transformations — PCA (principal component analysis) is often used to reduce dimensionality before applying K-Means
Experience it interactively
Adjust parameters, observe in real time, and build deep intuition with Riano’s interactive K-Means Clustering module.
Try K-Means Clustering on Riano →