Transformer Architecture

Definition

The transformer architecture is a deep learning architecture introduced by Vaswani et al. (2017) that forms the foundation of modern large language models. It replaces sequential processing with parallel processing using a self-attention mechanism, enabling efficient training on massive datasets.

Core Components

Self-Attention Mechanism

The self-attention mechanism allows each token (word or subword unit) to directly attend to all other tokens in the sequence:

Query, Key, Value matrices: Each token is projected into query, key, and value representations
Attention weights: Computed by comparing queries to keys, determining which tokens to focus on
Weighted sum: Values are summed using attention weights to produce output
Parallelizable: All tokens can be processed in parallel, unlike sequential RNNs

Multi-Head Attention

Multiple attention “heads” process the input simultaneously
Each head learns different relationships between tokens
Outputs are concatenated and projected
Enables learning diverse types of linguistic relationships

Feed-Forward Networks

Position-wise feed-forward networks applied to each token
Non-linear transformations increase model expressiveness
Applied uniformly across all positions

Positional Encoding

Tokens need information about their position in the sequence
Sinusoidal or learned positional encodings added to embeddings
Enables the model to understand word order without sequential processing

Layer Normalization and Residual Connections

Residual connections allow gradients to flow efficiently through deep networks
Layer normalization stabilizes training
Enables training of very deep models

Architecture Layers

Encoder

Stack of transformer layers processing input
Each layer: self-attention → feed-forward
Output: contextualized representation of input

Decoder

Stack of transformer layers
Attended to both previous tokens (self-attention) and encoder output (cross-attention)
Generates output tokens one at a time (or in parallel during training)

Optional Components

Cross-attention: allows decoder to attend to encoder outputs
Position-wise feed-forward networks
Output projection layer

Key Advantages Over Previous Architectures

RNNs and LSTMs

Parallelization: Transformers process all positions in parallel; RNNs are sequential
Long-range dependencies: Self-attention can directly connect distant tokens; RNNs have vanishing gradient problem
Training speed: Transformers are dramatically faster to train on GPUs/TPUs

CNNs

Flexible context: Transformers can attend to arbitrary positions; CNNs have fixed receptive fields
Interpretability: Attention weights can be visualized; CNN features are less interpretable

Training and Scaling

Large-Scale Training

Transformer architecture scales efficiently to billions of parameters
Trained on massive text corpora (hundreds of billions to trillions of tokens)
Foundation for large-language-models

Scaling Laws

Performance improves predictably as model size, data, and compute increase
Suggests large models can achieve strong performance with sufficient resources

Variants and Extensions

BERT and Masked Language Modeling

Bidirectional encoder training
Predicts masked tokens from surrounding context
Foundation for many NLP understanding tasks

GPT and Causal Language Modeling

Left-to-right decoder-only architecture
Predicts next token given previous tokens
Foundation for generative language models

T5 and Encoder-Decoder

Unified encoder-decoder architecture
Flexible for various NLP tasks (summarization, translation, question answering)

Connection to large-language-models

The transformer architecture is the foundation of all modern large-language-models. GPT, Claude, Gemini, and other LLMs use transformer architectures at their core.

Impact

NLP Revolution: Enabled dramatic improvements in language understanding and generation
Multimodal Models: Extended to vision, audio, and other modalities
Foundation Models: Transformer-based models serve as foundation for fine-tuning and prompting

🪴 PG Notes

Explorer