Transformer Architecture

Transformer Architecture Theory

The Transformer architecture, introduced in the seminal paper "Attention Is All You Need" (Vaswani et al., 2017), revolutionized natural language processing by abandoning recurrence in favor of a purely attention-based mechanism.

Core Equation: Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QK^T / √d_k)V

Key Components

1. Multi-Head Attention

Instead of performing a single attention function, the Transformer maps the queries, keys, and values h times with different, learned linear projections. This allows the model to jointly attend to information from different representation subspaces at different positions.

2. Feed-Forward Networks

Each layer contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between: FFN(x) = max(0, xW₁ + b₁)W₂ + b₂.

3. Residual Connections & Normalization

Each sub-layer (attention and feed-forward) has a residual connection around it, followed by layer normalization. The output of each sub-layer is LayerNorm(x + Sublayer(x)). This helps mitigate the vanishing gradient problem in deep networks.

Transformer Architecture