Activation Functions

Concept Overview

Activation functions are mathematical equations attached to each neuron in a neural network, determining whether the neuron should be activated. Crucially, they introduce non-linearities into the network. Without non-linear activation functions, regardless of how many layers a neural network has, it would simply behave as a single-layer perceptron capable only of solving linearly separable problems.

Mathematical Definition

A single artificial neuron calculates a weighted sum of its inputs, adds a bias term, and then applies an activation function f(z):

z = Σ_i=1ⁿ (w_i · x_i) + b
y = f(z)

where w represents weights, x represents inputs, b is the bias, and y is the resulting output.

Key Concepts

Sigmoid Function

Maps inputs into the range (0, 1). Historically popular because its output can be interpreted as a probability.

f(x) = 1 / (1 + e^-x)

Issue: Suffers heavily from the "vanishing gradient" problem. For large positive or negative inputs, the gradient approaches zero, stopping the network from learning during backpropagation.

Hyperbolic Tangent (Tanh)

Similar to Sigmoid but centered around zero, mapping inputs to (-1, 1).

f(x) = (e^x - e^-x) / (e^x + e^-x)

Zero-centering makes optimization easier than with Sigmoid, but it still suffers from vanishing gradients at extreme input values.

Rectified Linear Unit (ReLU)

The most widely used activation function in modern deep learning. It simply outputs the input if positive, otherwise zero.

f(x) = max(0, x)

Advantage: Does not saturate for positive inputs, mitigating the vanishing gradient problem. It is also highly computationally efficient.
Issue: "Dying ReLU" problem, where neurons getting negative inputs constantly output zero and never learn.

Leaky ReLU

A variant of ReLU that allows a small, non-zero gradient when the unit is not active.

f(x) = max(α·x, x)

Where α is a small constant (e.g., 0.01). This helps prevent the dying ReLU problem.

Exponential Linear Unit (ELU)

Designed to combine the best parts of ReLU and Leaky ReLU. Smooths the curve for negative values.

f(x) = x if x > 0 else α(e^x - 1)

Swish

Discovered by researchers at Google, Swish is a self-gated activation function that often outperforms ReLU in deeper networks.

f(x) = x · σ(βx) = x / (1 + e^-βx)

It is non-monotonic, meaning the function does not constantly increase or stay flat; it dips slightly below zero before rising.

Historical Context

Early neural networks (perceptrons) used step functions, which output exactly 0 or 1. Because step functions are non-differentiable at the threshold, they made gradient-based learning (like backpropagation) impossible.

The smooth, differentiable Sigmoid and Tanh functions replaced step functions in the 1980s. However, as networks grew deeper in the late 1990s and 2000s, the vanishing gradient problem became critical. The popularization of ReLU in 2010 by Nair and Hinton revolutionized the field, allowing for the successful training of deep convolutional neural networks (like AlexNet in 2012) and kicking off the modern deep learning era.

Real-world Applications

Sigmoid: Output layers of binary classification models.
Softmax (generalized Sigmoid): Output layers of multi-class classification models.
Tanh: Often used in Recurrent Neural Networks (RNNs) and LSTMs.
ReLU: Hidden layers of almost all modern Convolutional Neural Networks (CNNs) and Multi-Layer Perceptrons (MLPs).
Swish/GELU: Highly popular in state-of-the-art Transformer models like BERT and GPT.

Related Concepts

Backpropagation — depends heavily on the derivative of activation functions
Gradient Descent — optimization impacted by vanishing/exploding gradients
Perceptron — the simplest neural unit
Neural Network Learning — layers combining multiple activations

Activation Functions