Notes on Probability — Peter J. Cameron

University lecture notes for MAS108 (Probability I) at Queen Mary, University of London, December 2000. A rigorous first-semester course building probability from Kolmogorov’s axioms through random variables, standard distributions, and joint distributions.

Key Ideas

1. Axiomatic Probability (Chapter 1)

Probability is built from three axioms (Kolmogorov, 1933):

  1. P(A) ≥ 0 for any event A
  2. P(S) = 1 where S is the entire sample space
  3. P(A ∪ B) = P(A) + P(B) when A and B are disjoint (additivity)

Everything else — complements, inclusion-exclusion, independence — is derived from these three. This axiomatic approach sidesteps the philosophical question “What is probability?” and instead says: here are the rules; let’s see what follows.

The inclusion-exclusion principle generalizes additivity to overlapping events: P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

2. Sampling and Combinatorics (Chapter 1)

Four fundamental sampling modes:

OrderedUnordered
With replacementnᵏn+k-1 choose k
Without replacementn!/(n-k)!n choose k

The binomial coefficient n choose k = n! / (k!(n-k)!) is the workhorse of discrete probability. Cameron emphasizes that choosing the right sampling model is often the hardest part of a problem.

3. Independence (Chapter 1)

Two events A and B are independent if P(A ∩ B) = P(A) · P(B). This is the definition — not “one has no influence on the other.” Cameron warns: independence is surprisingly hard to detect by inspection, and should only be assumed when explicitly given or when dealing with independent physical processes (separate coin tosses, die rolls).

Mutual independence of n events requires all 2ⁿ − n − 1 subset conditions to hold, not just pairwise independence.

4. Conditional Probability and Bayes’ Theorem (Chapter 2)

Conditional probability: P(A | E) = P(A ∩ E) / P(E)

Theorem of Total Probability: If B₁, …, Bₙ partition S, then P(A) = Σ P(A | Bᵢ) · P(Bᵢ)

Bayes’ Theorem:

P(A | B) = P(B | A) · P(A) / P(B)

The clinical test example is the most striking illustration: a test that is 99% sensitive and 95% specific, applied to a disease with 0.1% prevalence, yields only a 1.94% chance the patient is actually a carrier given a positive result. This is because the vast majority of positives come from the 99.9% of non-carriers who have a 5% false positive rate. The base rate dominates.

“There is a very big difference between P(A | B) and P(B | A).”

This connects directly to base rate neglect — one of the most common cognitive errors in behavioral psychology.

Birthday Paradox: Using iterated conditional probability, Cameron shows that with 23+ people in a room, the chance of a shared birthday exceeds 50%. The calculation uses the chain rule: P(all different) = (1 − 1/365)(1 − 2/365)···(1 − (n−1)/365).

5. Discrete Random Variables (Chapter 3)

A random variable is a function from sample space to real numbers. Key quantities:

  • Expected value: E(X) = Σ xᵢ · P(X = xᵢ) — the “long-run average”
  • Variance: Var(X) = E(X²) − [E(X)]² — measures spread
  • Key property: For independent X, Y: E(XY) = E(X)·E(Y) and Var(X+Y) = Var(X) + Var(Y)

Standard discrete distributions:

DistributionParametersE(X)Var(X)Use case
Bernoulli(p)ppp(1−p)Single trial, success/failure
Binomial(n,p)n, pnpnp(1−p)Count of successes in n trials
Hypergeometric(n,M,N)n, M, NnM/NSampling without replacement
Geometric(p)p1/p(1−p)/p²Trials until first success
Poisson(λ)λλλRare events in fixed interval

6. Continuous Random Variables (Chapter 3)

Continuous r.v.s use a probability density function (PDF) instead of a mass function. P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx.

Standard continuous distributions:

DistributionParametersE(X)Var(X)
Uniform(a,b)a, b(a+b)/2(b−a)²/12
Normal N(μ,σ²)μ, σ²μσ²
Exponential(λ)λ1/λ1/λ²

The normal distribution is the most important: the bell curve f(x) = (1/σ√(2π)) exp(−(x−μ)²/(2σ²)). Any normal can be standardized to N(0,1) via Z = (X − μ)/σ.

7. Joint Distributions, Covariance, Correlation (Chapter 4)

  • Covariance: Cov(X,Y) = E(XY) − E(X)·E(Y) — measures linear association
  • Correlation: corr(X,Y) = Cov(X,Y) / √(Var(X)·Var(Y)) — normalized to [−1, 1]
  • Independent r.v.s have Cov = 0 (but Cov = 0 does not imply independence)
  • Conditional expectation: E(X | Y = y) — the expected value of X given knowledge of Y

Historical Note

Cameron traces probability’s origins to the Fermat-Pascal correspondence (mid-17th century) — two mathematicians solving the “Problem of Points” (how to fairly divide stakes in an interrupted game of chance). This is exactly the historical lineage Munger cites when listing “elementary math of permutations and combinations” as one of his most important mental models.

Connections

  • probability-theory — The concept page synthesizing probability across sources
  • bayes-theorem — The most practically important theorem in the notes; connects to decision-making under uncertainty
  • mental-models — Munger explicitly lists Fermat/Pascal probability as a foundational model
  • behavioral-psychology — Base rate neglect (the clinical test example) is one of the key cognitive biases
  • anchoring-bias — Base rate neglect is a form of anchoring on the test result rather than the prior probability
  • judgment — Bayesian reasoning is the mathematical framework for good judgment under uncertainty
  • large-language-models — Modern LLMs implicitly encode probabilistic reasoning; probability distributions are fundamental to their architecture

Source Details

  • Author: Peter J. Cameron
  • Institution: Queen Mary, University of London
  • Course: MAS108, Probability I
  • Date: December 2000
  • Pages: 94
  • Raw file: raw/probability.pdf