LSTM & Gated Units
Visualize the internal workings of an LSTM cell and its gating mechanisms.
LSTM & Gated Units
Concept Overview
Long Short-Term Memory (LSTM) is a highly successful and influential architecture of Recurrent Neural Networks (RNNs) designed specifically to overcome the vanishing gradient problem, enabling the network to learn long-term dependencies. Unlike standard RNNs, which have a single, simple neural network layer, an LSTM cell contains an intricate system of interacting components, primarily its core cell state and several regulatory gates. These gates learn to control what information is kept, discarded, or added to the cell state as data sequences are processed step-by-step.
Mathematical Definition
The core of the LSTM is the cell state (ct), which acts as a conveyor belt passing information down the sequence chain. The LSTM can add or remove information to this cell state, carefully regulated by three gates: the Forget Gate, Input Gate, and Output Gate.
1. Forget Gate
Decides what information from the past cell state (ct-1) should be discarded or kept. It looks at the previous hidden state (ht-1) and the current input (xt) and outputs a number between 0 and 1 using a sigmoid function (σ).
2. Input Gate & Cell Candidate
Decides what new information should be stored in the cell state. The input gate (it) decides which values will be updated, while the tanh layer creates a vector of new candidate values (c~t).
3. Cell State Update
The old cell state (ct-1) is updated to the new cell state (ct) by forgetting the decided amounts and adding the new scaled candidates. Here, ⊙ denotes the Hadamard (element-wise) product.
4. Output Gate & Hidden State Update
Decides what the next hidden state (ht) should be. The hidden state is essentially a filtered version of the cell state. The output gate (ot) decides which parts of the cell state to output, which is then multiplied by the tanh of the cell state.
Where:
- ht: Hidden state (output) at time step t.
- ct: Cell state at time step t.
- xt: Input vector at time step t.
- ft, it, ot: Forget, Input, and Output gate activation vectors respectively.
- c~t: Cell candidate vector.
- Wx, bx: Learnable weight matrices and bias vectors.
- σ: Sigmoid activation function mapping values to (0, 1).
Key Concepts
- Cell State: Often described as a "conveyor belt" running straight down the entire chain, with only minor linear interactions. It's very easy for information to just flow along it unchanged.
- Gates: Structures composed of a sigmoid neural net layer and a pointwise multiplication operation. They allow information to be optionally let through. Sigmoid outputs 0 (let nothing through) or 1 (let everything through).
- Vanishing Gradient Solution: By using the cell state and the additive updates (specifically the forget gate), gradients can flow backwards through time much more easily without exponentially decaying, allowing the network to learn long-range dependencies.
- Gated Recurrent Unit (GRU): A widely used variant of the LSTM introduced by Kyunghyun Cho et al. in 2014. GRUs combine the forget and input gates into a single "update gate" and merge the cell state and hidden state, resulting in a simpler model with fewer parameters that often performs similarly to LSTMs.
Historical Context
The LSTM architecture was introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1997. It was a monumental breakthrough in sequence modeling, successfully addressing the crippling vanishing gradient problem that plagued early Recurrent Neural Networks (RNNs) when trained with backpropagation through time.
For over two decades, LSTMs were the dominant approach for complex sequence tasks, achieving state-of-the-art results in handwriting recognition, speech recognition, machine translation, and text generation. While largely superseded by Transformer architectures in recent years for large-scale Natural Language Processing tasks, LSTMs remain highly relevant and efficient for smaller-scale sequential tasks, time series forecasting, and scenarios where continuous streaming processing without full sequence context is required.
Real-world Applications
- Time Series Forecasting: Predicting stock prices, weather, or energy consumption based on historical sequential data.
- Speech Recognition & Synthesis: Historically used extensively in voice assistants to transcribe audio streams into text.
- Machine Translation: Forming the basis of the first highly successful neural machine translation systems (seq2seq models).
- Predictive Maintenance: Analyzing sequential sensor data from machinery to predict when components might fail.
- Healthcare: Modeling patient trajectories using sequential electronic health records.
Related Concepts
- Recurrent Neural Network — The fundamental architecture LSTMs improve upon.
- Attention Mechanism — A technique often combined with LSTMs to further improve long-range dependencies, eventually leading to Transformers.
- Transformer Architecture — The modern successor to LSTMs for large-scale sequence processing.
Experience it interactively
Adjust parameters, observe in real time, and build deep intuition with Riano’s interactive LSTM & Gated Units module.
Try LSTM & Gated Units on Riano →