PCA (Dimensionality Reduction)

Find the principal axes that maximize variance and project data into a lower-dimensional space.

Principal Component Analysis (PCA)

Concept Overview

Principal Component Analysis (PCA) is a foundational technique in machine learning and statistics used for dimensionality reduction. It works by transforming a dataset with many possibly correlated variables into a smaller set of uncorrelated variables, called principal components. These components are chosen such that the first principal component captures the maximum possible variance in the data, the second captures the maximum remaining variance orthogonal to the first, and so on. This allows complex datasets to be visualized, compressed, and modeled efficiently while retaining most of the important structural information.

Mathematical Definition

Given a centered dataset X (an n × d matrix where n is the number of samples and d is the number of features), PCA seeks a set of orthogonal unit vectors (weights) w that maximize the variance of the projected data.

Maximize variance for the first component:

w₁ = argmax_‖w‖=1(w^TX^TXw)

Alternatively, formulated using the covariance matrix Σ:

Σ = (1 / (n - 1)) X^TX

w₁ = argmax_‖w‖=1(w^TΣw)

The solution to this optimization problem relies on finding the eigenvectors and eigenvalues of the covariance matrix Σ. The eigenvectors correspond to the principal components (directions), and the eigenvalues represent the variance of the data along those respective directions.

Key Concepts

Variance and Information

In the context of PCA, variance is synonymous with information. A direction with high variance means the data points are spread out along that axis, making it easier to distinguish between different observations. A direction with low variance contains less discriminative information and is often dominated by noise.

Orthogonality

Each principal component is constrained to be orthogonal (perpendicular) to all previous components. This ensures that each new component captures a completely independent (uncorrelated) dimension of the data's variance, preventing redundancy.

Dimensionality Reduction vs. Feature Selection

Unlike feature selection techniques which simply discard original features, PCA creates entirely new features that are linear combinations of all original features. This makes it a feature extraction technique. While it reduces the number of dimensions, the resulting components often lack direct physical interpretation.

Historical Context

PCA was independently invented by Karl Pearson in 1901 and later developed independently by Harold Hotelling in the 1930s. Pearson described it as finding "lines and planes of closest fit to systems of points in space." Hotelling coined the term "principal components" while working on educational psychology and test scoring, aiming to define a smaller set of fundamental psychological traits from numerous test scores.

Despite predating modern computing by decades, PCA remains one of the most widely used algorithms today, forming the basis for techniques like Eigenfaces in early computer vision and Latent Semantic Analysis in natural language processing.

Real-world Applications

Data Visualization: Reducing high-dimensional datasets (like gene expression data or word embeddings) down to 2D or 3D for human inspection.
Image Compression: Representing images using only the most significant principal components, dramatically reducing storage size while maintaining recognizable visual features.
Noise Filtering: By discarding principal components associated with very small variances, random noise in sensor data can be effectively filtered out.
Preprocessing for Machine Learning: Speeding up training times and reducing overfitting by feeding models a smaller, uncorrelated set of features instead of the raw data.

Related Concepts

Linear Transformations — PCA relies heavily on matrix operations, eigenvectors, and singular value decomposition (SVD).
K-Means Clustering — PCA is often used as a preprocessing step before applying K-Means to alleviate the curse of dimensionality.
Autoencoders — In neural networks, linear autoencoders learn transformations that are essentially equivalent to PCA.

PCA (Dimensionality Reduction)

Principal Component Analysis (PCA)

Concept Overview

Mathematical Definition

Key Concepts

Variance and Information

Orthogonality

Dimensionality Reduction vs. Feature Selection

Historical Context

Real-world Applications

Related Concepts

Experience it interactively

More in Machine Learning

Gradient Descent

Perceptron

K-Means Clustering