Definition
The transformer architecture is a deep learning architecture introduced by Vaswani et al. (2017) that forms the foundation of modern large language models. It replaces sequential processing with parallel processing using a self-attention mechanism, enabling efficient training on massive datasets.
Core Components
Self-Attention Mechanism
The self-attention mechanism allows each token (word or subword unit) to directly attend to all other tokens in the sequence:
- Query, Key, Value matrices: Each token is projected into query, key, and value representations
- Attention weights: Computed by comparing queries to keys, determining which tokens to focus on
- Weighted sum: Values are summed using attention weights to produce output
- Parallelizable: All tokens can be processed in parallel, unlike sequential RNNs
Multi-Head Attention
- Multiple attention “heads” process the input simultaneously
- Each head learns different relationships between tokens
- Outputs are concatenated and projected
- Enables learning diverse types of linguistic relationships
Feed-Forward Networks
- Position-wise feed-forward networks applied to each token
- Non-linear transformations increase model expressiveness
- Applied uniformly across all positions
Positional Encoding
- Tokens need information about their position in the sequence
- Sinusoidal or learned positional encodings added to embeddings
- Enables the model to understand word order without sequential processing
Layer Normalization and Residual Connections
- Residual connections allow gradients to flow efficiently through deep networks
- Layer normalization stabilizes training
- Enables training of very deep models
Architecture Layers
Encoder
- Stack of transformer layers processing input
- Each layer: self-attention → feed-forward
- Output: contextualized representation of input
Decoder
- Stack of transformer layers
- Attended to both previous tokens (self-attention) and encoder output (cross-attention)
- Generates output tokens one at a time (or in parallel during training)
Optional Components
- Cross-attention: allows decoder to attend to encoder outputs
- Position-wise feed-forward networks
- Output projection layer
Key Advantages Over Previous Architectures
RNNs and LSTMs
- Parallelization: Transformers process all positions in parallel; RNNs are sequential
- Long-range dependencies: Self-attention can directly connect distant tokens; RNNs have vanishing gradient problem
- Training speed: Transformers are dramatically faster to train on GPUs/TPUs
CNNs
- Flexible context: Transformers can attend to arbitrary positions; CNNs have fixed receptive fields
- Interpretability: Attention weights can be visualized; CNN features are less interpretable
Training and Scaling
Large-Scale Training
- Transformer architecture scales efficiently to billions of parameters
- Trained on massive text corpora (hundreds of billions to trillions of tokens)
- Foundation for large-language-models
Scaling Laws
- Performance improves predictably as model size, data, and compute increase
- Suggests large models can achieve strong performance with sufficient resources
Variants and Extensions
BERT and Masked Language Modeling
- Bidirectional encoder training
- Predicts masked tokens from surrounding context
- Foundation for many NLP understanding tasks
GPT and Causal Language Modeling
- Left-to-right decoder-only architecture
- Predicts next token given previous tokens
- Foundation for generative language models
T5 and Encoder-Decoder
- Unified encoder-decoder architecture
- Flexible for various NLP tasks (summarization, translation, question answering)
Connection to large-language-models
The transformer architecture is the foundation of all modern large-language-models. GPT, Claude, Gemini, and other LLMs use transformer architectures at their core.
Impact
- NLP Revolution: Enabled dramatic improvements in language understanding and generation
- Multimodal Models: Extended to vision, audio, and other modalities
- Foundation Models: Transformer-based models serve as foundation for fine-tuning and prompting