Support Vector Machine

Visualize how a linear SVM learns a maximum-margin decision boundary using gradient descent on hinge loss.

Support Vector Machine

Overview

A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for classification, regression, and outlier detection. At its core, an SVM works by mapping data points into a high-dimensional space and finding the optimal hyperplane that separates the different classes of data. The "optimal" hyperplane is the one that maximizes the margin, or distance, between the closest data points of the different classes. These closest points are known as "support vectors", hence the name of the algorithm.

Definition

Given a set of training data points (x_i, y_i) where x_i ∈ &reals;ⁿ and y_i ∈ {-1, 1} for i = 1, …, m training examples, a linear SVM finds the weight vector w and bias b that define the separating hyperplane w · x - b = 0. The objective is to maximize the margin, which is equivalent to minimizing the squared Euclidean norm of the weight vector, ||w||², subject to the constraints:

y_i(w · x_i - b) ≥ 1, for all i = 1, …, m

In real-world scenarios where the data is not perfectly linearly separable, we introduce a regularization parameter C and "slack variables" to allow some misclassifications. The standard soft-margin SVM objective using the hinge loss formulation becomes an unconstrained minimization problem:

min_w,b ¹⁄₂ ||w||² + C Σ_i=1^m max(0, 1 - y_i(w · x_i - b))

Key Concepts

Maximum Margin Hyperplane

The fundamental insight of SVMs is that maximizing the margin provides better generalization bounds on new, unseen data points, compared to any arbitrary separating hyperplane. The decision boundary relies entirely on the subset of data points that touch or cross the margin boundary—the support vectors.

Soft Margin and Parameter C

The regularization parameter C controls the trade-off between achieving a large margin and minimizing the classification error on the training data. A high C heavily penalizes errors, resulting in a narrower margin but fewer training mistakes (potentially overfitting). A low C encourages a wider margin but permits more points to violate the margin constraints (potentially underfitting).

The Kernel Trick

While the standard formulation only finds linear decision boundaries, SVMs can efficiently perform non-linear classification via the "kernel trick". By implicitly mapping the input features into a much higher-dimensional space where a linear hyperplane can separate them, SVMs can solve complex problems without explicitly computing the costly high-dimensional coordinates. Common kernels include the Radial Basis Function (RBF), Polynomial, and Sigmoid kernels.

Historical Context

The foundational concepts of the Support Vector Machine were developed by Vladimir Vapnik and Alexey Chervonenkis in 1963. In 1992, Bernhard Boser, Isabelle Guyon, and Vapnik suggested a way to create non-linear classifiers by applying the kernel trick to maximum-margin hyperplanes. The current standard incarnation (the soft margin formulation) was proposed by Corinna Cortes and Vapnik in 1995. During the late 1990s and early 2000s, SVMs were widely considered one of the most effective and robust supervised learning methods available, dominating many classification tasks before the widespread resurgence of deep neural networks.

Applications

Text and Hypertext Categorization: Used extensively for classifying documents and identifying spam emails, performing very well in high-dimensional sparse feature spaces.
Image Classification: Employed in facial recognition systems and handwriting recognition before Convolutional Neural Networks (CNNs) became the standard.
Bioinformatics: Widely applied for protein classification, cancer classification based on microarray data, and gene expression analysis.

Related Concepts

Perceptron: A simpler linear classifier that merely finds any separating hyperplane, lacking the maximum-margin optimization of SVM.
Logistic Regression: A probabilistic linear classifier that optimizes log-loss rather than the hinge loss used in SVMs.
Gradient Descent: While SVMs are traditionally optimized via quadratic programming (e.g., the SMO algorithm), the soft margin formulation using hinge loss is commonly optimized using subgradient descent methods (e.g., Pegasos).

Support Vector Machine