Overview

A comprehensive technical reference on large language models—AI systems trained on vast amounts of text data to generate human-like language. LLMs represent a major breakthrough in natural language processing and have become foundational to modern AI systems.

Foundational Concepts

Tokens

Text is broken into tokens—subword units that can represent single characters, common substrings, or whole words depending on tokenization scheme. Most LLMs use subword tokenization (like Byte-Pair Encoding).

  • Vocabulary size: Typical LLMs have vocabularies of 50,000-200,000 tokens
  • Context window: Maximum number of tokens the model processes at once (varies; modern models range from 4K to 200K+ tokens)
  • Computational cost: Scales with token count; longer inputs and outputs increase computation

Neural Networks and Deep Learning

LLMs are built on artificial neural networks—systems inspired by biological neurons arranged in layers.

  • Layers: Data flows through multiple computational layers, each learning progressively more complex patterns
  • Parameters: Weights and biases learned during training; LLM parameters range from billions to trillions
  • Activation functions: Non-linearities (ReLU, GELU) enabling networks to learn complex relationships
  • Backpropagation: The learning algorithm that adjusts parameters to minimize error

Probability and Next-Token Prediction

LLMs are fundamentally probabilistic: they predict the probability distribution over possible next tokens given preceding tokens.

  • Softmax layer: Converts model outputs into probability distributions over vocabulary
  • Sampling vs. Argmax: Can either sample from the distribution (introducing randomness) or select highest-probability token (deterministic)
  • Temperature: Controls randomness; higher temperature increases diversity, lower temperature makes output more deterministic

Transformer Architecture

The transformer architecture, introduced in 2017, revolutionized LLMs by enabling parallel processing and long-range dependencies.

Self-Attention Mechanism

The core innovation allowing tokens to directly attend to (weight) other tokens regardless of distance.

  • Queries, Keys, Values: Each token produces three vectors; similarity between queries and keys determines attention weights
  • Masked attention: During language generation, attention is masked to prevent attending to future tokens
  • Interpretability: Attention weights can sometimes reveal which parts of input the model focused on

Multi-Head Attention

Instead of single attention operation, use multiple parallel attention heads, each learning different attention patterns.

  • Ensemble effect: Different heads attend to different aspects (e.g., one head to pronouns, another to verbs)
  • Computational scaling: Multiple heads increase computation but improve model expressiveness
  • Head specialization: Research shows different heads learn semantically meaningful patterns

Positional Encoding

Since transformers process all tokens in parallel (unlike sequential RNNs), they require explicit position information.

  • Absolute positional encodings: Fixed encodings based on token position
  • Relative positional encodings: Encode relative distances between tokens
  • Rotary embeddings (RoPE): Modern approach using rotation matrices
  • ALiBi (Attention with Linear Biases): Simpler approach adding position-dependent bias to attention

Feedforward Networks

After attention, each token passes through a feedforward network (typically two dense layers with activation).

  • Expansion and contraction: Usually expands to 4x hidden dimension then contracts back
  • Token-independent processing: Feedforward is applied identically to each token position
  • Computational bulk: Feedforward layers contain majority of model parameters

Layer Normalization

Applied to stabilize training and improve convergence.

  • Pre-normalization vs. Post-normalization: Whether to normalize before or after attention/feedforward
  • RMSNorm: Simplified normalization used in modern models (GPT-3, Llama)

Tokenization

Tokenization converts raw text into tokens and vice versa—a crucial but often overlooked component.

Byte-Pair Encoding (BPE)

Iteratively merges most frequent byte/character pairs into new tokens, building vocabulary bottom-up.

  • Simplicity: Elegant algorithm, easy to implement
  • Language-agnostic: Works on any text, including code and non-Latin scripts
  • Used by: GPT series, most modern LLMs

SentencePiece

Combines BPE with handling of spaces and special characters directly.

  • Space handling: Treats spaces as tokens, enabling reconstruction without ambiguity
  • Used by: T5, recent language models

Limitations

Tokenization can create unexpected behaviors:

  • Representation bias: Words of different frequencies tokenize differently
  • Multilingual challenges: Unequal efficiency across languages
  • Arithmetic errors: Models sometimes struggle with arithmetic partly due to token-level representation

Training: Pre-training vs. Fine-tuning

Pre-training

Models train on massive, unlabeled text corpora (trillions of tokens) with self-supervised learning objective: predict next token.

  • Objective: Minimize cross-entropy loss between predicted and actual next token
  • Data scale: Unprecedented amounts (Common Crawl, Book Corpus, Wikipedia, etc.)
  • Duration: Weeks to months on clusters of thousands of GPUs/TPUs
  • Cost: Billions of dollars for frontier models
  • Emergence: Learned representations contain rich linguistic and world knowledge

Fine-tuning

Pre-trained models adapt to specific tasks through training on smaller, task-specific labeled datasets.

  • Instruction fine-tuning: Training on examples of instructions and correct responses
  • Smaller datasets: Often tens of thousands of examples suffice
  • Rapid convergence: Fine-tuning typically takes days or hours
  • Low cost: Computationally cheap relative to pre-training

Reinforcement Learning from Human Feedback (RLHF)

An additional training phase using human judgments to improve model behavior.

  1. Reward model: Train a separate model to predict human preference between outputs
  2. Policy optimization: Use reinforcement learning to update the language model to maximize predicted rewards
  3. Alignment: Helps ensure model outputs align with human values and intentions
  4. Trade-offs: Can reduce factuality or diversity in pursuit of “alignment”

Scaling Laws

Empirical research reveals predictable relationships between model size, data size, and performance.

Parameters and Performance

Larger models (more parameters) consistently perform better across tasks, following power-law scaling.

  • Chinchilla scaling: Optimal allocation roughly equal compute to parameters and data tokens
  • No plateau observed: Performance continues improving even at 100B+ parameters
  • Downstream scaling: Better pre-training translates to better downstream performance

Data and Performance

More training data improves performance, also following power laws.

  • Data efficiency: Larger models learn more per example
  • Irreversibility: Once trained on certain data, models cannot “unlearn” it
  • Data diversity: Mixture of domains matters; diversity improves generalization

Compute-Optimal Training

Given fixed compute budget, optimal allocation balances:

  • Model size: How many parameters to train
  • Data size: How many tokens to train on
  • Training duration: How many epochs to train

Modern consensus: roughly equal allocation to parameters and data tokens gives best results.

Model Compression

As LLMs become larger, compression techniques reduce size for deployment.

Quantization

Representing weights with fewer bits than float32.

  • INT8 quantization: 4x memory reduction with minimal performance loss
  • INT4 and below: Aggressive quantization with noticeable degradation
  • Calibration: Critical step measuring statistics for scale/zero-point selection

Pruning

Removing less-important parameters or connections.

  • Unstructured pruning: Remove individual weights
  • Structured pruning: Remove entire channels, heads, or layers
  • Iterative pruning: Prune during or after training
  • Trade-off: Reduces size but requires retraining to maintain accuracy

Distillation

Train a smaller “student” model to imitate a larger “teacher” model.

  • Knowledge transfer: Student learns patterns from teacher’s outputs
  • Efficiency: Smaller models run faster and use less memory
  • Quality loss: Student typically performs worse than teacher

Alignment and Safety

As LLMs become more capable, alignment—ensuring they behave according to human intentions—becomes critical.

Challenges

  • Specification gaming: Models find unintended ways to optimize objectives
  • Distributional shift: Models behave differently on out-of-distribution inputs
  • Adversarial robustness: Carefully crafted inputs can trigger harmful behavior
  • Value disagreement: Different humans want different behaviors

Approaches

  • RLHF: Using human feedback to steer behavior (see above)
  • Constitutional AI: Training with explicit principles to follow
  • Interpretability: Understanding model internals to identify failure modes
  • Monitoring and auditing: Red-teaming and testing for failure cases

Retrieval-Augmented Generation (RAG)

Instead of relying solely on pre-training knowledge, augment LLMs with external information.

Mechanism

  1. Query: Convert user input to embedding
  2. Retrieval: Search external knowledge base for relevant documents
  3. Augmentation: Prepend retrieved documents to input context
  4. Generation: Generate response conditioned on both user query and retrieved context

Advantages

  • Factuality: Access to current information not in training data
  • Source transparency: Can show which documents informed the answer
  • Reduced hallucination: Grounding in external facts decreases false claims

Challenges

  • Retrieval quality: Poor retrieval degrades output
  • Context length: Retrieved documents consume token budget
  • Integration complexity: Requires embedding models, retrieval systems, document processing

Tools and Frameworks

Model Architecture

  • Transformers library (Hugging Face): Standard PyTorch implementations of models, simple API
  • JAX: High-performance numerical computing, used in large-scale training
  • PyTorch: Most common deep learning framework

Training

  • Megatron: NVIDIA framework for training massive models efficiently
  • DeepSpeed: Microsoft framework for distributed training with memory optimizations
  • Ray: Distributed computing for hyperparameter tuning and distributed training

Serving and Inference

  • vLLM: Efficient serving of LLMs with optimized attention and batching
  • TensorRT-LLM: NVIDIA’s optimization framework for inference
  • Ollama: Local model serving and management

Evaluation

  • MMLU: Multiple-choice questions across 57 subjects
  • HellaSwag: Commonsense reasoning benchmark
  • HumanEval: Code generation evaluation
  • TruthfulQA: Factuality assessment
  • Custom benchmarks: Task-specific evaluation sets

Current Research Frontiers

Mixture of Experts (MoE)

Instead of computing all parameters for every input, route to sparse set of expert sub-networks.

  • Scaling efficiency: Can increase model size without proportional compute cost
  • Routing mechanisms: How to assign inputs to experts; learned routing is challenging
  • Load balancing: Preventing some experts from becoming overloaded

Multimodal Models

Extending LLMs to process and generate images, audio, video alongside text.

  • Vision-language models: CLIP-style contrastive training or instruction-tuned image-to-text
  • Audio and music: Emerging multimodal models handling speech and music
  • Unified representations: Learning shared embedding spaces across modalities

Long Context Windows

Modern models support increasingly long context (100K+ tokens vs. original 2K).

  • ALiBi and RoPE: Position encoding approaches that extrapolate better to longer lengths
  • Sparse attention: Reduce quadratic attention cost with structured sparsity
  • Recurrent transformers: Process long sequences without dense attention

Interpretability and Mechanistic Understanding

Understanding how LLMs actually compute and make decisions.

  • Feature visualization: What patterns do individual neurons/components detect?
  • Causal interventions: Ablating components to determine their function
  • Representation analysis: How is information encoded in activations?
  • Circuit analysis: How do components mechanistically interact?

Efficient Training and Inference

Reducing computational and energy costs.

  • Sparse training: Train with sparsity from start rather than pruning after
  • Bit-width reduction: Training with low-precision arithmetic
  • Adaptive computation: Use variable-length processing per token
  • Federated learning: Training on distributed data without centralization