UMAP Visualization

Reduce high-dimensional data into a low-dimensional map by preserving local topological structures.

Concept Overview

Uniform Manifold Approximation and Projection (UMAP) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance.

Mathematical Definition

UMAP builds a fuzzy simplicial complex representation of the data and optimizes the low-dimensional representation to have a similar fuzzy topological structure.

Local fuzzy simplicial set probabilities (high-dimensional):

p_i|j = exp(-(d(x_i, x_j) - ρ_i) / σ_i)

Symmetrization:

p_ij = p_i|j + p_j|i - p_i|jp_j|i

Low-dimensional probabilities:

q_ij = (1 + a(y_i - y_j)^2b)^-1

Cross Entropy Optimization:

CE(P, Q) = Σ_i≠j [ p_ij log(p_ij / q_ij) + (1 - p_ij) log((1 - p_ij) / (1 - q_ij)) ]

Key Concepts

Number of Neighbors (n_neighbors)

This parameter controls how UMAP balances local versus global structure in the data. It does this by constraining the size of the local neighborhood UMAP will look at when attempting to learn the manifold structure of the data. Low values will push UMAP to focus more on local structure, while high values will push UMAP to look at the broader neighborhood, preserving more global structure.

Minimum Distance (min_dist)

The `min_dist` parameter provides the minimum distance apart that points are allowed to be in the low dimensional representation. This means that low values of `min_dist` will result in clumpier embeddings. This can be useful if you are interested in clustering, or in discovering tight and clean clusters.

Cross Entropy Loss vs KL Divergence

Unlike t-SNE which uses Kullback-Leibler (KL) divergence and primarily penalizes points that are close in high dimensions but far in low dimensions, UMAP uses Cross Entropy. Cross Entropy also penalizes points that are far in high dimensions but close in low dimensions, helping UMAP to better preserve the global structure of the data.

Historical Context

UMAP was published in 2018 by Leland McInnes, John Healy, and James Melville. It was developed based on the foundations of Riemannian geometry and algebraic topology, specifically fuzzy simplicial sets.

Before UMAP, t-SNE was the dominant algorithm for non-linear dimensionality reduction and visualization. However, t-SNE suffered from slow execution times on large datasets and often failed to preserve the global distance structure between different clusters. UMAP addressed both of these issues, becoming the modern standard for visualizing complex, high-dimensional datasets.

Real-world Applications

Bioinformatics: Dominant method for clustering and visualizing single-cell RNA sequencing (scRNA-seq) data to identify cell types.
Machine Learning: Feature extraction and dimensionality reduction as a preprocessing step before applying clustering or classification algorithms.
Computer Vision: Exploring the latent space of deep generative models like GANs and VAEs to understand learned semantic representations.

Related Concepts

t-SNE Visualization — An earlier and very popular alternative for non-linear dimensionality reduction.
PCA Dimensionality Reduction — A linear technique that maximizes variance, often used to preprocess data before UMAP.
K-Means Clustering — Often applied to the low-dimensional embeddings produced by UMAP to discover clusters.

UMAP Visualization