LSTM & Gated Units

Concept Overview

Long Short-Term Memory (LSTM) is a highly successful and influential architecture of Recurrent Neural Networks (RNNs) designed specifically to overcome the vanishing gradient problem, enabling the network to learn long-term dependencies. Unlike standard RNNs, which have a single, simple neural network layer, an LSTM cell contains an intricate system of interacting components, primarily its core cell state and several regulatory gates. These gates learn to control what information is kept, discarded, or added to the cell state as data sequences are processed step-by-step.

Mathematical Definition

The core of the LSTM is the cell state (c_t), which acts as a conveyor belt passing information down the sequence chain. The LSTM can add or remove information to this cell state, carefully regulated by three gates: the Forget Gate, Input Gate, and Output Gate.

1. Forget Gate

Decides what information from the past cell state (c_t-1) should be discarded or kept. It looks at the previous hidden state (h_t-1) and the current input (x_t) and outputs a number between 0 and 1 using a sigmoid function (σ).

f_t = σ( W_f[h_t-1, x_t] + b_f )

2. Input Gate & Cell Candidate

Decides what new information should be stored in the cell state. The input gate (i_t) decides which values will be updated, while the tanh layer creates a vector of new candidate values (c~_t).

i_t = σ( W_i[h_t-1, x_t] + b_i )

c~_t = tanh( W_c[h_t-1, x_t] + b_c )

3. Cell State Update

The old cell state (c_t-1) is updated to the new cell state (c_t) by forgetting the decided amounts and adding the new scaled candidates. Here, &odot; denotes the Hadamard (element-wise) product.

c_t = f_t &odot; c_t-1 + i_t &odot; c~_t

4. Output Gate & Hidden State Update

Decides what the next hidden state (h_t) should be. The hidden state is essentially a filtered version of the cell state. The output gate (o_t) decides which parts of the cell state to output, which is then multiplied by the tanh of the cell state.

o_t = σ( W_o[h_t-1, x_t] + b_o )

h_t = o_t &odot; tanh( c_t )

Where:

h_t: Hidden state (output) at time step t.
c_t: Cell state at time step t.
x_t: Input vector at time step t.
f_t, i_t, o_t: Forget, Input, and Output gate activation vectors respectively.
c~_t: Cell candidate vector.
W_x, b_x: Learnable weight matrices and bias vectors.
σ: Sigmoid activation function mapping values to (0, 1).

Key Concepts

Cell State: Often described as a "conveyor belt" running straight down the entire chain, with only minor linear interactions. It's very easy for information to just flow along it unchanged.
Gates: Structures composed of a sigmoid neural net layer and a pointwise multiplication operation. They allow information to be optionally let through. Sigmoid outputs 0 (let nothing through) or 1 (let everything through).
Vanishing Gradient Solution: By using the cell state and the additive updates (specifically the forget gate), gradients can flow backwards through time much more easily without exponentially decaying, allowing the network to learn long-range dependencies.
Gated Recurrent Unit (GRU): A widely used variant of the LSTM introduced by Kyunghyun Cho et al. in 2014. GRUs combine the forget and input gates into a single "update gate" and merge the cell state and hidden state, resulting in a simpler model with fewer parameters that often performs similarly to LSTMs.

Historical Context

The LSTM architecture was introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1997. It was a monumental breakthrough in sequence modeling, successfully addressing the crippling vanishing gradient problem that plagued early Recurrent Neural Networks (RNNs) when trained with backpropagation through time.

For over two decades, LSTMs were the dominant approach for complex sequence tasks, achieving state-of-the-art results in handwriting recognition, speech recognition, machine translation, and text generation. While largely superseded by Transformer architectures in recent years for large-scale Natural Language Processing tasks, LSTMs remain highly relevant and efficient for smaller-scale sequential tasks, time series forecasting, and scenarios where continuous streaming processing without full sequence context is required.

Real-world Applications

Time Series Forecasting: Predicting stock prices, weather, or energy consumption based on historical sequential data.
Speech Recognition & Synthesis: Historically used extensively in voice assistants to transcribe audio streams into text.
Machine Translation: Forming the basis of the first highly successful neural machine translation systems (seq2seq models).
Predictive Maintenance: Analyzing sequential sensor data from machinery to predict when components might fail.
Healthcare: Modeling patient trajectories using sequential electronic health records.

Related Concepts

Recurrent Neural Network — The fundamental architecture LSTMs improve upon.
Attention Mechanism — A technique often combined with LSTMs to further improve long-range dependencies, eventually leading to Transformers.
Transformer Architecture — The modern successor to LSTMs for large-scale sequence processing.

LSTM & Gated Units

LSTM & Gated Units

Concept Overview

Mathematical Definition

1. Forget Gate

2. Input Gate & Cell Candidate

3. Cell State Update

4. Output Gate & Hidden State Update

Key Concepts

Historical Context

Real-world Applications

Related Concepts

Experience it interactively

More in Machine Learning

Gradient Descent

Perceptron

K-Means Clustering