Source: Large Language Models (LLMs) Reference

Overview

A comprehensive technical reference on large language models—AI systems trained on vast amounts of text data to generate human-like language. LLMs represent a major breakthrough in natural language processing and have become foundational to modern AI systems.

Foundational Concepts

Tokens

Text is broken into tokens—subword units that can represent single characters, common substrings, or whole words depending on tokenization scheme. Most LLMs use subword tokenization (like Byte-Pair Encoding).

Vocabulary size: Typical LLMs have vocabularies of 50,000-200,000 tokens
Context window: Maximum number of tokens the model processes at once (varies; modern models range from 4K to 200K+ tokens)
Computational cost: Scales with token count; longer inputs and outputs increase computation

Neural Networks and Deep Learning

LLMs are built on artificial neural networks—systems inspired by biological neurons arranged in layers.

Layers: Data flows through multiple computational layers, each learning progressively more complex patterns
Parameters: Weights and biases learned during training; LLM parameters range from billions to trillions
Activation functions: Non-linearities (ReLU, GELU) enabling networks to learn complex relationships
Backpropagation: The learning algorithm that adjusts parameters to minimize error

Probability and Next-Token Prediction

LLMs are fundamentally probabilistic: they predict the probability distribution over possible next tokens given preceding tokens.

Softmax layer: Converts model outputs into probability distributions over vocabulary
Sampling vs. Argmax: Can either sample from the distribution (introducing randomness) or select highest-probability token (deterministic)
Temperature: Controls randomness; higher temperature increases diversity, lower temperature makes output more deterministic

Transformer Architecture

The transformer architecture, introduced in 2017, revolutionized LLMs by enabling parallel processing and long-range dependencies.

Self-Attention Mechanism

The core innovation allowing tokens to directly attend to (weight) other tokens regardless of distance.

Queries, Keys, Values: Each token produces three vectors; similarity between queries and keys determines attention weights
Masked attention: During language generation, attention is masked to prevent attending to future tokens
Interpretability: Attention weights can sometimes reveal which parts of input the model focused on

Multi-Head Attention

Instead of single attention operation, use multiple parallel attention heads, each learning different attention patterns.

Ensemble effect: Different heads attend to different aspects (e.g., one head to pronouns, another to verbs)
Computational scaling: Multiple heads increase computation but improve model expressiveness
Head specialization: Research shows different heads learn semantically meaningful patterns

Positional Encoding

Since transformers process all tokens in parallel (unlike sequential RNNs), they require explicit position information.

Absolute positional encodings: Fixed encodings based on token position
Relative positional encodings: Encode relative distances between tokens
Rotary embeddings (RoPE): Modern approach using rotation matrices
ALiBi (Attention with Linear Biases): Simpler approach adding position-dependent bias to attention

Feedforward Networks

After attention, each token passes through a feedforward network (typically two dense layers with activation).

Expansion and contraction: Usually expands to 4x hidden dimension then contracts back
Token-independent processing: Feedforward is applied identically to each token position
Computational bulk: Feedforward layers contain majority of model parameters

Layer Normalization

Applied to stabilize training and improve convergence.

Pre-normalization vs. Post-normalization: Whether to normalize before or after attention/feedforward
RMSNorm: Simplified normalization used in modern models (GPT-3, Llama)

Tokenization

Tokenization converts raw text into tokens and vice versa—a crucial but often overlooked component.

Byte-Pair Encoding (BPE)

Iteratively merges most frequent byte/character pairs into new tokens, building vocabulary bottom-up.

Simplicity: Elegant algorithm, easy to implement
Language-agnostic: Works on any text, including code and non-Latin scripts
Used by: GPT series, most modern LLMs

SentencePiece

Combines BPE with handling of spaces and special characters directly.

Space handling: Treats spaces as tokens, enabling reconstruction without ambiguity
Used by: T5, recent language models

Limitations

Tokenization can create unexpected behaviors:

Representation bias: Words of different frequencies tokenize differently
Multilingual challenges: Unequal efficiency across languages
Arithmetic errors: Models sometimes struggle with arithmetic partly due to token-level representation

Training: Pre-training vs. Fine-tuning

Pre-training

Models train on massive, unlabeled text corpora (trillions of tokens) with self-supervised learning objective: predict next token.

Objective: Minimize cross-entropy loss between predicted and actual next token
Data scale: Unprecedented amounts (Common Crawl, Book Corpus, Wikipedia, etc.)
Duration: Weeks to months on clusters of thousands of GPUs/TPUs
Cost: Billions of dollars for frontier models
Emergence: Learned representations contain rich linguistic and world knowledge

Fine-tuning

Pre-trained models adapt to specific tasks through training on smaller, task-specific labeled datasets.

Instruction fine-tuning: Training on examples of instructions and correct responses
Smaller datasets: Often tens of thousands of examples suffice
Rapid convergence: Fine-tuning typically takes days or hours
Low cost: Computationally cheap relative to pre-training

Reinforcement Learning from Human Feedback (RLHF)

An additional training phase using human judgments to improve model behavior.

Reward model: Train a separate model to predict human preference between outputs
Policy optimization: Use reinforcement learning to update the language model to maximize predicted rewards
Alignment: Helps ensure model outputs align with human values and intentions
Trade-offs: Can reduce factuality or diversity in pursuit of “alignment”

Scaling Laws

Empirical research reveals predictable relationships between model size, data size, and performance.

Parameters and Performance

Larger models (more parameters) consistently perform better across tasks, following power-law scaling.

Chinchilla scaling: Optimal allocation roughly equal compute to parameters and data tokens
No plateau observed: Performance continues improving even at 100B+ parameters
Downstream scaling: Better pre-training translates to better downstream performance

Data and Performance

More training data improves performance, also following power laws.

Data efficiency: Larger models learn more per example
Irreversibility: Once trained on certain data, models cannot “unlearn” it
Data diversity: Mixture of domains matters; diversity improves generalization

Compute-Optimal Training

Given fixed compute budget, optimal allocation balances:

Model size: How many parameters to train
Data size: How many tokens to train on
Training duration: How many epochs to train

Modern consensus: roughly equal allocation to parameters and data tokens gives best results.

Model Compression

As LLMs become larger, compression techniques reduce size for deployment.

Quantization

Representing weights with fewer bits than float32.

INT8 quantization: 4x memory reduction with minimal performance loss
INT4 and below: Aggressive quantization with noticeable degradation
Calibration: Critical step measuring statistics for scale/zero-point selection

Pruning

Removing less-important parameters or connections.

Unstructured pruning: Remove individual weights
Structured pruning: Remove entire channels, heads, or layers
Iterative pruning: Prune during or after training
Trade-off: Reduces size but requires retraining to maintain accuracy

Distillation

Train a smaller “student” model to imitate a larger “teacher” model.

Knowledge transfer: Student learns patterns from teacher’s outputs
Efficiency: Smaller models run faster and use less memory
Quality loss: Student typically performs worse than teacher

Alignment and Safety

As LLMs become more capable, alignment—ensuring they behave according to human intentions—becomes critical.

Challenges

Specification gaming: Models find unintended ways to optimize objectives
Distributional shift: Models behave differently on out-of-distribution inputs
Adversarial robustness: Carefully crafted inputs can trigger harmful behavior
Value disagreement: Different humans want different behaviors

Approaches

RLHF: Using human feedback to steer behavior (see above)
Constitutional AI: Training with explicit principles to follow
Interpretability: Understanding model internals to identify failure modes
Monitoring and auditing: Red-teaming and testing for failure cases

Retrieval-Augmented Generation (RAG)

Instead of relying solely on pre-training knowledge, augment LLMs with external information.

Mechanism

Query: Convert user input to embedding
Retrieval: Search external knowledge base for relevant documents
Augmentation: Prepend retrieved documents to input context
Generation: Generate response conditioned on both user query and retrieved context

Advantages

Factuality: Access to current information not in training data
Source transparency: Can show which documents informed the answer
Reduced hallucination: Grounding in external facts decreases false claims

Challenges

Retrieval quality: Poor retrieval degrades output
Context length: Retrieved documents consume token budget
Integration complexity: Requires embedding models, retrieval systems, document processing

Tools and Frameworks

Model Architecture

Transformers library (Hugging Face): Standard PyTorch implementations of models, simple API
JAX: High-performance numerical computing, used in large-scale training
PyTorch: Most common deep learning framework

Training

Megatron: NVIDIA framework for training massive models efficiently
DeepSpeed: Microsoft framework for distributed training with memory optimizations
Ray: Distributed computing for hyperparameter tuning and distributed training

Serving and Inference

vLLM: Efficient serving of LLMs with optimized attention and batching
TensorRT-LLM: NVIDIA’s optimization framework for inference
Ollama: Local model serving and management

Evaluation

MMLU: Multiple-choice questions across 57 subjects
HellaSwag: Commonsense reasoning benchmark
HumanEval: Code generation evaluation
TruthfulQA: Factuality assessment
Custom benchmarks: Task-specific evaluation sets

Current Research Frontiers

Mixture of Experts (MoE)

Instead of computing all parameters for every input, route to sparse set of expert sub-networks.

Scaling efficiency: Can increase model size without proportional compute cost
Routing mechanisms: How to assign inputs to experts; learned routing is challenging
Load balancing: Preventing some experts from becoming overloaded

Multimodal Models

Extending LLMs to process and generate images, audio, video alongside text.

Vision-language models: CLIP-style contrastive training or instruction-tuned image-to-text
Audio and music: Emerging multimodal models handling speech and music
Unified representations: Learning shared embedding spaces across modalities

Long Context Windows

Modern models support increasingly long context (100K+ tokens vs. original 2K).

ALiBi and RoPE: Position encoding approaches that extrapolate better to longer lengths
Sparse attention: Reduce quadratic attention cost with structured sparsity
Recurrent transformers: Process long sequences without dense attention

Interpretability and Mechanistic Understanding

Understanding how LLMs actually compute and make decisions.

Feature visualization: What patterns do individual neurons/components detect?
Causal interventions: Ablating components to determine their function
Representation analysis: How is information encoded in activations?
Circuit analysis: How do components mechanistically interact?

Efficient Training and Inference

Reducing computational and energy costs.

Sparse training: Train with sparsity from start rather than pruning after
Bit-width reduction: Training with low-precision arithmetic
Adaptive computation: Use variable-length processing per token
Federated learning: Training on distributed data without centralization

🪴 PG Notes

Explorer

Source: Large Language Models (LLMs) Reference

Overview

Foundational Concepts

Tokens

Neural Networks and Deep Learning

Probability and Next-Token Prediction

Transformer Architecture

Self-Attention Mechanism

Multi-Head Attention

Positional Encoding

Feedforward Networks

Layer Normalization

Tokenization

Byte-Pair Encoding (BPE)

SentencePiece

Limitations

Training: Pre-training vs. Fine-tuning

Pre-training

Fine-tuning

Reinforcement Learning from Human Feedback (RLHF)

Scaling Laws

Parameters and Performance

Data and Performance

Compute-Optimal Training

Model Compression

Quantization

Pruning

Distillation

Alignment and Safety

Challenges

Approaches

Retrieval-Augmented Generation (RAG)

Mechanism

Advantages

Challenges

Tools and Frameworks

Model Architecture

Training

Serving and Inference

Evaluation

Current Research Frontiers

Mixture of Experts (MoE)

Multimodal Models

Long Context Windows

Interpretability and Mechanistic Understanding

Efficient Training and Inference

Related Concepts

Graph View

Table of Contents

Backlinks