Gradient Descent

Gradient Descent Optimization

Concept Overview

Gradient descent is the workhorse optimization algorithm behind modern machine learning. The idea is elegantly simple: to minimize a function, take repeated steps in the direction where the function decreases most steeply. This iterative approach transforms the abstract problem of "finding the best parameters" into a concrete, mechanical procedure that scales to models with billions of parameters—the foundation upon which deep learning is built.

Mathematical Definition

Given a differentiable cost function f(θ), gradient descent updates the parameters iteratively:

θ_n+1 = θ_n − α · ∇f(θ_n)

where θ_n is the parameter vector at step n, α is the learning rate (a positive scalar controlling step size), and ∇f(θ_n) is the gradient of the cost function evaluated at the current parameters. The negative sign ensures we move downhill—toward lower values of f.

Example: The Interactive Function

The interactive visualization in this module uses a simple quadratic function to demonstrate gradient descent in one dimension:

f(x) = x² + 2x + 1 = (x + 1)²

f′(x) = 2x + 2

Update rule: x_n+1 = x_n − α · (2x_n + 2)

The minimum is at x = −1, where f(−1) = 0. This is a convex function with a single global minimum, making it ideal for illustrating how gradient descent converges. By adjusting the learning rate in the interactive, you can observe how the step size affects convergence speed and stability.

The Learning Rate

The learning rate α is the single most important hyperparameter in gradient descent. Its value determines the character of convergence:

Too large: Steps overshoot the minimum. The parameters oscillate wildly and may diverge to infinity. For our quadratic example, divergence occurs when α > 1.
Too small: Convergence is guaranteed but painfully slow. Training deep neural networks with an excessively small learning rate can take prohibitively long.
Just right: Rapid, stable convergence to the minimum. For a quadratic function f(x) = ax², the optimal learning rate is α = 1/a.

Convergence condition for f(x) = (x+1)²:

0 < α < 1 → converges

α = 1 → reaches minimum in one step

α > 1 → diverges (oscillates with growing amplitude)

Gradient Descent Variants

Batch Gradient Descent

Computes the gradient using the entire training dataset at each step. Produces the most accurate gradient estimate but is computationally expensive for large datasets.

θ_n+1 = θ_n − α · (1/N) · Σ (i=1 to N) ∇f_i(θ_n)

Stochastic Gradient Descent (SGD)

Updates parameters using the gradient from a single randomly selected training example. Much faster per iteration but introduces noise into the gradient estimate. This noise can actually help escape shallow local minima.

θ_n+1 = θ_n − α · ∇f_i(θ_n) where i is randomly chosen

Mini-batch Gradient Descent

The practical compromise: compute the gradient over a small random subset (mini-batch) of the data, typically 32–512 examples. This balances the accuracy of batch gradient descent with the speed of SGD and is the standard approach in deep learning.

Advanced Optimizers

Momentum

Adds a "velocity" term that accumulates past gradients, allowing the optimizer to build speed in consistent directions and dampen oscillations in inconsistent ones—much like a ball rolling downhill gains momentum.

v_n+1 = β · v_n + ∇f(θ_n)

θ_n+1 = θ_n − α · v_n+1

typically β = 0.9

Adam (Adaptive Moment Estimation)

Combines momentum with per-parameter adaptive learning rates. Adam maintains running averages of both the first moment (mean) and second moment (uncentered variance) of the gradient, adapting the learning rate for each parameter individually. It is the default optimizer for most deep learning tasks.

m_n = β₁ · m_n-1 + (1−β₁) · ∇f(θ_n) (first moment)

v_n = β₂ · v_n-1 + (1−β₂) · (∇f(θ_n))² (second moment)

θ_n+1 = θ_n − α · m̂_n / (√v̂_n + ε)

where m̂, v̂ are bias-corrected estimates

Challenges: Local Minima, Saddle Points, and Convexity

Gradient descent is guaranteed to find the global minimum only for convex functions (where any local minimum is the global minimum). The interactive's quadratic function is convex, but real-world loss landscapes in deep learning are highly non-convex.

Local minima: Points where the gradient is zero and the function curves upward, but the value is higher than the global minimum. Gradient descent can get trapped here.
Saddle points: Points where the gradient is zero but the function curves upward in some directions and downward in others. In high-dimensional spaces, saddle points are far more common than local minima and can significantly slow convergence.
Plateaus: Regions where the gradient is nearly zero, causing updates to stall. Momentum-based optimizers help traverse these flat regions.

Key Concepts

Gradient: The vector of partial derivatives pointing in the direction of steepest ascent. Its negative points downhill.
Convergence rate: For convex quadratic functions, gradient descent converges linearly (error decreases by a constant factor per step). The rate depends on the condition number of the Hessian matrix.
Learning rate schedules: Strategies that reduce the learning rate over time (e.g., step decay, cosine annealing, warm-up) to achieve fast initial progress followed by fine-grained convergence.
Backpropagation: The algorithm for efficiently computing gradients in neural networks via the chain rule, making gradient descent tractable for models with millions of parameters.

Historical Context

The method of steepest descent was first proposed by Augustin-Louis Cauchy in 1847. For over a century, it remained a tool of applied mathematics and operations research. The algorithm gained transformative importance with the development of backpropagation for neural networks in the 1980s (popularized by Rumelhart, Hinton, and Williams in 1986), which made it possible to compute gradients efficiently in multi-layer networks.

The deep learning revolution of the 2010s drove rapid innovation in gradient descent variants. The Adam optimizer (Kingma and Ba, 2014) became the de facto standard. Modern large language models are trained using gradient descent on thousands of GPUs, optimizing loss functions over trillions of tokens—a scale Cauchy could not have imagined.

Real-world Applications

Neural networks and deep learning: Every neural network—from image classifiers to language models—is trained by minimizing a loss function via gradient descent.
Linear and logistic regression: The simplest machine learning models use gradient descent to fit parameters to data when closed-form solutions are impractical.
Computer vision: Convolutional neural networks trained with gradient descent power facial recognition, autonomous driving, and medical imaging.
Natural language processing: Transformers, the architecture behind modern language models, are optimized using variants of gradient descent.
Robotics and control: Policy gradient methods use gradient descent to optimize control policies in reinforcement learning.

Related Concepts

Taylor Series — gradient descent uses the first-order Taylor approximation to guide updates
Harmonic Oscillator — damped convergence to equilibrium as a physical analog of optimization
Logistic Map — iterative maps that can converge, oscillate, or diverge depending on parameters, paralleling learning rate effects
Sorting Algorithms — computational complexity analysis applies to both sorting and optimization algorithms