Definition

The transformer architecture is a deep learning architecture introduced by Vaswani et al. (2017) that forms the foundation of modern large language models. It replaces sequential processing with parallel processing using a self-attention mechanism, enabling efficient training on massive datasets.

Core Components

Self-Attention Mechanism

The self-attention mechanism allows each token (word or subword unit) to directly attend to all other tokens in the sequence:

  • Query, Key, Value matrices: Each token is projected into query, key, and value representations
  • Attention weights: Computed by comparing queries to keys, determining which tokens to focus on
  • Weighted sum: Values are summed using attention weights to produce output
  • Parallelizable: All tokens can be processed in parallel, unlike sequential RNNs

Multi-Head Attention

  • Multiple attention “heads” process the input simultaneously
  • Each head learns different relationships between tokens
  • Outputs are concatenated and projected
  • Enables learning diverse types of linguistic relationships

Feed-Forward Networks

  • Position-wise feed-forward networks applied to each token
  • Non-linear transformations increase model expressiveness
  • Applied uniformly across all positions

Positional Encoding

  • Tokens need information about their position in the sequence
  • Sinusoidal or learned positional encodings added to embeddings
  • Enables the model to understand word order without sequential processing

Layer Normalization and Residual Connections

  • Residual connections allow gradients to flow efficiently through deep networks
  • Layer normalization stabilizes training
  • Enables training of very deep models

Architecture Layers

Encoder

  • Stack of transformer layers processing input
  • Each layer: self-attention → feed-forward
  • Output: contextualized representation of input

Decoder

  • Stack of transformer layers
  • Attended to both previous tokens (self-attention) and encoder output (cross-attention)
  • Generates output tokens one at a time (or in parallel during training)

Optional Components

  • Cross-attention: allows decoder to attend to encoder outputs
  • Position-wise feed-forward networks
  • Output projection layer

Key Advantages Over Previous Architectures

RNNs and LSTMs

  • Parallelization: Transformers process all positions in parallel; RNNs are sequential
  • Long-range dependencies: Self-attention can directly connect distant tokens; RNNs have vanishing gradient problem
  • Training speed: Transformers are dramatically faster to train on GPUs/TPUs

CNNs

  • Flexible context: Transformers can attend to arbitrary positions; CNNs have fixed receptive fields
  • Interpretability: Attention weights can be visualized; CNN features are less interpretable

Training and Scaling

Large-Scale Training

  • Transformer architecture scales efficiently to billions of parameters
  • Trained on massive text corpora (hundreds of billions to trillions of tokens)
  • Foundation for large-language-models

Scaling Laws

  • Performance improves predictably as model size, data, and compute increase
  • Suggests large models can achieve strong performance with sufficient resources

Variants and Extensions

BERT and Masked Language Modeling

  • Bidirectional encoder training
  • Predicts masked tokens from surrounding context
  • Foundation for many NLP understanding tasks

GPT and Causal Language Modeling

  • Left-to-right decoder-only architecture
  • Predicts next token given previous tokens
  • Foundation for generative language models

T5 and Encoder-Decoder

  • Unified encoder-decoder architecture
  • Flexible for various NLP tasks (summarization, translation, question answering)

Connection to large-language-models

The transformer architecture is the foundation of all modern large-language-models. GPT, Claude, Gemini, and other LLMs use transformer architectures at their core.

Impact

  • NLP Revolution: Enabled dramatic improvements in language understanding and generation
  • Multimodal Models: Extended to vision, audio, and other modalities
  • Foundation Models: Transformer-based models serve as foundation for fine-tuning and prompting

See Also