VC Dimension and the Fundamental Theorem of Statistical Learning — from Scratch

117 min read Original article ↗

May 2026

This post answers a single question: when does learning from data actually work?

You train a model on samples, it performs well on those samples, and you hope it performs well on new data. When is that hope justified? The answer turns out to be a clean equivalence: a hypothesis class is learnable if and only if it has finite VC dimension. This is the Fundamental Theorem of Statistical Learning.

We'll build every piece from scratch: Markov's inequality, Hoeffding's lemma, concentration inequalities, the union bound, VC dimension, the Sauer–Shelah lemma, symmetrization, and the generalization bounds that tie it all together. Nothing is assumed beyond basic probability (expectations, independence, indicator random variables). We prove both directions of the equivalence: the upper bound (finite VC dimension implies learnability) and the lower bound (via information theory: total variation, KL divergence, Pinsker's inequality, and Le Cam's method).

This is Part 1 of two. Part 2 continues the story with Rademacher complexity — a probabilistic refinement of VC dimension that gives tighter, data-dependent bounds — and shows how all the pieces connect in one beautiful chain.

1. The Setup: What Does “Learning” Mean?

We work in the simplest interesting setting: binary classification.

Definition (The Learning Problem)

  • Instance space \(\mathcal{X}\): the set of all possible inputs (images, documents, feature vectors, etc.).
  • Label space \(\mathcal{Y} = \{0, 1\}\): binary labels.
  • Distribution \(\mathcal{D}\): an unknown probability distribution over \(\mathcal{X} \times \mathcal{Y}\). Nature draws \((x, y) \sim \mathcal{D}\).
  • Hypothesis class \(\mathcal{H} \subseteq \{0,1\}^{\mathcal{X}}\): a fixed set of classifiers \(h : \mathcal{X} \to \{0,1\}\) that the learner is allowed to pick from.
  • Training sample \(S = ((x_1, y_1), \ldots, (x_m, y_m))\): drawn i.i.d. from \(\mathcal{D}\).

Definition (True Risk and Empirical Risk) The true risk (or population loss) of a hypothesis \(h\) is $$ L_{\mathcal{D}}(h) = \Pr_{(x,y)\sim\mathcal{D}}[h(x) \neq y]. $$ The empirical risk (or training error) on a sample \(S\) of size \(m\) is $$ L_S(h) = \frac{1}{m}\sum_{i=1}^m \mathbf{1}[h(x_i) \neq y_i]. $$

Example (Spam Classification) Suppose \(\mathcal{X}\) = all possible emails and \(\mathcal{Y} = \{0, 1\}\) where 1 = spam. Nature has some distribution \(\mathcal{D}\) over (email, label) pairs — this captures both which emails show up and how they're labeled. We don't know \(\mathcal{D}\); we only see a training set of \(m\) labeled emails.

The true risk \(L_{\mathcal{D}}(h)\) is the probability that classifier \(h\) misclassifies a random future email. We never know this exactly — it involves all emails that could ever arrive.

The empirical risk \(L_S(h)\) is the fraction of training emails that \(h\) gets wrong. This we can compute.

The fundamental question: when does small \(L_S(h)\) guarantee small \(L_{\mathcal{D}}(h)\)?

The learner sees only \(S\) and wants to find \(h \in \mathcal{H}\) with small \(L_{\mathcal{D}}(h)\). The most natural strategy is Empirical Risk Minimization (ERM): pick any \(h \in \mathcal{H}\) that minimizes \(L_S(h)\). In plain English: choose the hypothesis that makes the fewest mistakes on the training data.

But ERM has an obvious failure mode: overfitting. If \(\mathcal{H}\) is expressive enough, some \(h \in \mathcal{H}\) will have \(L_S(h) = 0\) purely by memorizing the training data, while having terrible true risk. So we need to understand when ERM is trustworthy.

This leads to two questions that the rest of the post will answer:

  1. When is learning possible at all? Under what conditions on \(\mathcal{H}\) can any algorithm generalize?
  2. When does ERM specifically work? When does the training error track the true error simultaneously for every hypothesis?

These questions are formalized by two definitions: PAC learnability (question 1) and uniform convergence (question 2). Let's build up to each one.

What Should “Learnable” Mean?

We want a definition that captures: "given enough data, the learner will probably do well." There are three things to get right:

  • How well? The learner's error should be within \(\epsilon\) of the best possible error in \(\mathcal{H}\). We use \(\epsilon\) ("accuracy parameter") because we can't demand perfection — with finite data, there's always some approximation error.
  • How probably? Since \(S\) is random, the learner might get unlucky (e.g., all training emails are spam). We allow a failure probability \(\delta\) ("confidence parameter").
  • Against what? The guarantee must hold for every distribution \(\mathcal{D}\). We don't know how nature generates data, so we can't assume anything about it.

Definition (PAC Learning) A hypothesis class \(\mathcal{H}\) is PAC learnable (Probably Approximately Correct) if there exists a learning algorithm \(A\) and a function \(m_{\mathcal{H}} : (0,1)^2 \to \mathbb{N}\) such that for every \(\epsilon, \delta \in (0,1)\) and every distribution \(\mathcal{D}\) over \(\mathcal{X} \times \{0,1\}\): if \(m \geq m_{\mathcal{H}}(\epsilon, \delta)\), then with probability at least \(1 - \delta\) over the draw of \(S \sim \mathcal{D}^m\), $$ L_{\mathcal{D}}(A(S)) \leq \min_{h \in \mathcal{H}} L_{\mathcal{D}}(h) + \epsilon. $$ The function \(m_{\mathcal{H}}\) is called the sample complexity.

Let's unpack this piece by piece:

  • \(A(S)\) is the hypothesis that the algorithm outputs after seeing training data \(S\).
  • \(\min_{h \in \mathcal{H}} L_{\mathcal{D}}(h)\) is the error of the best hypothesis in the class. Call this \(h^*\). We can't beat \(h^*\) (we're restricted to picking from \(\mathcal{H}\)), so we aim to be within \(\epsilon\) of it.
  • The "\(\text{for every } \mathcal{D}\)" part is crucial: the same algorithm must work regardless of how data is distributed. This is what makes PAC learning distribution-free.
  • \(m_{\mathcal{H}}(\epsilon, \delta)\) tells us how much data we need. Smaller \(\epsilon\) or \(\delta\) require more data.

This is the agnostic PAC model: we don't assume any \(h \in \mathcal{H}\) is perfect (i.e., we don't assume \(\min_h L_{\mathcal{D}}(h) = 0\)). Real-world problems are almost always agnostic — no simple model perfectly captures spam, disease, or fraud.

Example (PAC Learning in Practice) Suppose \(\mathcal{H}\) = linear classifiers in \(\mathbb{R}^2\) and we want:

  • \(\epsilon = 0.05\): the learner's error is within 5% of the best linear classifier.
  • \(\delta = 0.01\): this guarantee holds 99% of the time.

PAC learnability means there's a magic number \(m_{\mathcal{H}}(0.05, 0.01)\) such that if we collect at least that many labeled examples, any distribution over \(\mathbb{R}^2 \times \{0,1\}\) will produce a training set from which a good classifier can be found.

If no such \(m\) exists (i.e., for any sample size, some adversarial distribution defeats the learner), then \(\mathcal{H}\) is not PAC learnable.

Uniform Convergence: A Sufficient Condition for ERM

PAC learnability asks whether some algorithm works. But we'd like to know when the simplest algorithm — ERM — works. For ERM to succeed, we need something stronger: the training error must be a reliable estimate of the true error, not just for the hypothesis ERM picks, but for every hypothesis simultaneously.

Why "simultaneously"? Because ERM searches over \(\mathcal{H}\) to find the hypothesis with the lowest training error. If training error is only accurate for a few hypotheses but wildly off for others, ERM might pick one of the "wildly off" ones — a hypothesis that looks great on training data but is actually terrible.

Definition (Uniform Convergence) \(\mathcal{H}\) has the uniform convergence property if for every \(\epsilon, \delta > 0\), there exists \(m_0\) such that for all \(m \geq m_0\) and all distributions \(\mathcal{D}\), $$ \Pr_S\!\left[\sup_{h \in \mathcal{H}} |L_S(h) - L_{\mathcal{D}}(h)| > \epsilon\right] < \delta. $$

The \(\sup\) over \(\mathcal{H}\) is doing the heavy lifting: it says the worst-case gap between training and true error, over all hypotheses, is small. Compare this to pointwise convergence, where we'd only require \(|L_S(h) - L_{\mathcal{D}}(h)| < \epsilon\) for each individual \(h\) separately.

Example (Uniform vs. Pointwise) Imagine \(\mathcal{H} = \{h_1, h_2\}\) with two hypotheses, and you have \(m = 100\) training points.

  • Pointwise: \(L_S(h_1)\) is close to \(L_{\mathcal{D}}(h_1)\), AND separately \(L_S(h_2)\) is close to \(L_{\mathcal{D}}(h_2)\). Each guarantee might hold with probability 0.99.
  • Uniform: BOTH gaps are small at the same time, with probability 0.99. This is stronger because bad luck on \(h_1\) and bad luck on \(h_2\) might coincide.

With two hypotheses, the difference is minor (union bound costs a factor of 2). But when \(\mathcal{H}\) is infinite, pointwise convergence is essentially free (law of large numbers) while uniform convergence may or may not hold — and it's uniform convergence that determines whether ERM works.

Why Uniform Convergence Makes ERM Work

The punchline is a short chain of three inequalities. But to see why each one is true, let's first walk through a concrete scenario, then state the general argument.

Concrete Scenario Suppose \(\mathcal{H} = \{h_1, h_2, h_3\}\) and the true risks (which we can't see) are:

  • \(L_{\mathcal{D}}(h_1) = 0.20\) — this is the best hypothesis (\(h^* = h_1\)).
  • \(L_{\mathcal{D}}(h_2) = 0.35\).
  • \(L_{\mathcal{D}}(h_3) = 0.50\).

Now suppose uniform convergence holds with \(\epsilon = 0.03\). This means every hypothesis's training error is within 0.03 of its true error. One possible training set might give:

  • \(L_S(h_1) = 0.22\) (true: 0.20, gap: 0.02 \(\leq\) 0.03 \(\checkmark\))
  • \(L_S(h_2) = 0.33\) (true: 0.35, gap: 0.02 \(\leq\) 0.03 \(\checkmark\))
  • \(L_S(h_3) = 0.48\) (true: 0.50, gap: 0.02 \(\leq\) 0.03 \(\checkmark\))

ERM picks \(h_1\) (lowest training error: 0.22). The true error of ERM's pick is \(L_{\mathcal{D}}(h_1) = 0.20\). The best possible true error is also 0.20. ERM nailed it.

But what if the training errors shift around? Uniform convergence allows gaps up to \(\epsilon = 0.03\), so another training set might give:

  • \(L_S(h_1) = 0.23\)
  • \(L_S(h_2) = 0.32\)
  • \(L_S(h_3) = 0.52\)

Now ERM picks \(h_1\) again (0.23 is still the lowest). True error of the pick is still 0.20. Fine.

Worst case: The gaps conspire maximally against us:

  • \(L_S(h_1) = 0.23\) (true 0.20, shifted UP by 0.03)
  • \(L_S(h_2) = 0.32\) (true 0.35, shifted DOWN by 0.03)
  • \(L_S(h_3) = 0.47\) (true 0.50, shifted DOWN by 0.03)

Now ERM picks \(h_1\) (0.23 vs 0.32 vs 0.47 — still the smallest). Even in this worst case, ERM picks the right answer. But could it pick the wrong one?

Near-worst case with a different class: Suppose \(L_{\mathcal{D}}(h_1) = 0.20\) and \(L_{\mathcal{D}}(h_2) = 0.22\). With \(\epsilon = 0.03\):

  • \(L_S(h_1) = 0.23\) (shifted up by 0.03)
  • \(L_S(h_2) = 0.19\) (shifted down by 0.03)

ERM picks \(h_2\) (training error 0.19 < 0.23). The true error of ERM's pick is 0.22, while optimal is 0.20. The gap is 0.02, which is less than \(2\epsilon = 0.06\). ERM picked the "wrong" hypothesis, but its true error is still close to optimal.

The example shows the pattern: uniform convergence can mislead ERM into picking a suboptimal hypothesis, but it can't mislead it badly. The worst that can happen is a \(2\epsilon\) gap. Here's why, step by step:

Uniform Convergence \(\Rightarrow\) ERM Works (The Three-Step Argument)

Assume uniform convergence: \(|L_S(h) - L_{\mathcal{D}}(h)| \leq \epsilon\) for every \(h \in \mathcal{H}\).

Let \(\hat{h}\) be the hypothesis ERM picks (minimizes training error), and let \(h^*\) be the truly best hypothesis (minimizes true error). We want to show \(L_{\mathcal{D}}(\hat{h})\) is close to \(L_{\mathcal{D}}(h^*)\).

Step 1. \(L_{\mathcal{D}}(\hat{h}) \leq L_S(\hat{h}) + \epsilon\).

Why: Uniform convergence says training error is within \(\epsilon\) of true error for every hypothesis, including \(\hat{h}\). So \(L_{\mathcal{D}}(\hat{h}) - L_S(\hat{h}) \leq \epsilon\), which rearranges to give Step 1. (The true error of ERM's pick can't be much worse than its training error.)

Step 2. \(L_S(\hat{h}) \leq L_S(h^*)\).

Why: This is just the definition of ERM. ERM picks \(\hat{h}\) to have the lowest training error among all hypotheses. So its training error is at most that of any other hypothesis, including \(h^*\). (This step has nothing to do with uniform convergence — it's purely about what ERM does.)

Step 3. \(L_S(h^*) \leq L_{\mathcal{D}}(h^*) + \epsilon\).

Why: Uniform convergence again, this time applied to \(h^*\). The training error of \(h^*\) is within \(\epsilon\) of its true error. (The training error of the best hypothesis can't be much higher than its true error.)

Chaining Steps 1–3:

$$ \underbrace{L_{\mathcal{D}}(\hat{h})}_{\text{true error of ERM's pick}} \leq \underbrace{L_S(\hat{h})}_{\text{training error of ERM's pick}} + \epsilon \leq \underbrace{L_S(h^*)}_{\text{training error of best}} + \epsilon \leq \underbrace{L_{\mathcal{D}}(h^*)}_{\text{true error of best}} + 2\epsilon. $$

Verifying with Numbers From the near-worst-case example above: \(L_{\mathcal{D}}(h_1) = 0.20\), \(L_{\mathcal{D}}(h_2) = 0.22\), \(\epsilon = 0.03\). ERM picked \(\hat{h} = h_2\) with \(L_S(h_2) = 0.19\).

  • Step 1: \(L_{\mathcal{D}}(h_2) = 0.22 \leq 0.19 + 0.03 = 0.22\). \(\checkmark\)
  • Step 2: \(L_S(h_2) = 0.19 \leq L_S(h_1) = 0.23\). \(\checkmark\) (ERM picked \(h_2\) because 0.19 < 0.23.)
  • Step 3: \(L_S(h_1) = 0.23 \leq 0.20 + 0.03 = 0.23\). \(\checkmark\)

Chain: \(0.22 \leq 0.19 + 0.03 \leq 0.23 + 0.03 \leq 0.20 + 0.06\). That is, \(0.22 \leq 0.26\).

ERM's true error (0.22) is within \(2\epsilon = 0.06\) of optimal (0.20). The actual gap is 0.02, well within the bound. The bound is loose here, but it's a worst-case guarantee that holds no matter how the errors shift around.

In short: uniform convergence prevents ERM from being fooled. The training error might be slightly optimistic about ERM's pick (Step 1) and slightly pessimistic about the true best (Step 3), but these two distortions can each be at most \(\epsilon\), so ERM lands within \(2\epsilon\) of optimal.

The remarkable fact — which we'll prove — is that uniform convergence, PAC learnability, and finite VC dimension are all equivalent. The same condition governs whether learning is possible, whether ERM works, and whether the hypothesis class has bounded combinatorial complexity.

2. Markov’s Inequality

Before we can prove anything about generalization, we need tools to bound probabilities. The most basic is Markov's inequality.

Theorem (Markov’s Inequality) Let \(X \geq 0\) be a non-negative random variable with finite expectation. For any \(a > 0\), $$ \Pr(X \geq a) \leq \frac{\mathbb{E}[X]}{a}. $$

Proof Starting from the definition of expectation: $$ \mathbb{E}[X] = \int_0^{\infty} x \, f(x)\, dx \geq \int_a^{\infty} x \, f(x)\, dx \geq a \int_a^{\infty} f(x)\, dx = a \cdot \Pr(X \geq a). $$ Dividing both sides by \(a\) gives the result. \(\square\)

Example 1 Suppose a company's average salary is $60,000. What can we say about the fraction of employees earning at least $200,000?

Markov says: \(\Pr(\text{Salary} \geq 200{,}000) \leq 60{,}000 / 200{,}000 = 0.3.\) At most 30% of employees earn $200K+. The bound is loose but completely distribution-free.

Example 2 Let \(X\) be the number of heads in 100 fair coin flips. Then \(\mathbb{E}[X] = 50\), so \(\Pr(X \geq 90) \leq 50/90 \approx 0.556\). The true probability is about \(10^{-9}\) — Markov is very loose here because it ignores the shape of the distribution. We'll tighten it shortly.

Markov's inequality is weak because it uses only the mean. Its power comes from applying it to transformed random variables — this is the Chernoff bound idea that drives everything that follows.

3. Hoeffding’s Lemma

To get exponentially tight bounds, we need to control the moment generating function (MGF) of a bounded, zero-mean random variable. That's exactly what Hoeffding's lemma does.

Definition (Moment Generating Function) For a random variable \(X\), the moment generating function is \(M_X(s) = \mathbb{E}[e^{sX}]\), defined for all \(s\) where the expectation exists.

The MGF is useful because \(e^{sx}\) is monotone increasing, so bounding \(\mathbb{E}[e^{sX}]\) gives tail bounds via Markov's inequality applied to the non-negative random variable \(e^{sX}\).

Theorem (Hoeffding’s Lemma) Let \(X\) be a random variable with \(\mathbb{E}[X] = 0\) and \(a \leq X \leq b\). Then for all \(s > 0\), $$ \mathbb{E}[e^{sX}] \leq \exp\!\left(\frac{s^2(b-a)^2}{8}\right). $$

Proof

Step 1: Convexity bound. Since \(e^{sx}\) is convex and \(x \in [a, b]\), the function lies below the chord connecting \((a, e^{sa})\) and \((b, e^{sb})\):

$$ e^{sx} \leq \frac{b - x}{b - a}\, e^{sa} + \frac{x - a}{b - a}\, e^{sb}. $$

Step 2: Take expectations. Using \(\mathbb{E}[X] = 0\):

$$ \mathbb{E}[e^{sX}] \leq \frac{b}{b-a}\, e^{sa} + \frac{-a}{b-a}\, e^{sb}. $$

Step 3: Simplify with substitution. Let \(p = \frac{-a}{b-a}\) (so \(1 - p = \frac{b}{b-a}\)) and \(u = s(b-a)\). Then:

$$ \mathbb{E}[e^{sX}] \leq (1 - p)\, e^{-pu} + p\, e^{(1-p)u} = e^{-pu}\bigl(1 - p + p\, e^{u}\bigr). $$

Define \(\varphi(u) = -pu + \ln(1 - p + pe^{u})\), so the bound is \(\mathbb{E}[e^{sX}] \leq e^{\varphi(u)}\).

Step 4: Taylor bound on \(\varphi\). We compute:

  • \(\varphi(0) = 0\),
  • \(\varphi'(u) = -p + \frac{pe^{u}}{1 - p + pe^{u}}\), so \(\varphi'(0) = -p + p = 0\),
  • \(\varphi''(u) = \frac{pe^u(1-p)}{(1 - p + pe^u)^2}\).

Writing \(t = \frac{pe^u}{1-p+pe^u} \in (0,1)\), we get \(\varphi''(u) = t(1-t) \leq \frac{1}{4}\).

By Taylor's theorem with the Lagrange remainder: \(\varphi(u) = \varphi(0) + \varphi'(0)\,u + \frac{1}{2}\varphi''(c)\,u^2\) for some \(c\). Since \(\varphi''(c) \leq 1/4\):

$$ \varphi(u) \leq \frac{u^2}{8} = \frac{s^2(b-a)^2}{8}. \quad \square $$

Example Let \(X\) be uniform on \(\{-1, +1\}\). Then \(\mathbb{E}[X] = 0\), \(a = -1\), \(b = 1\). The lemma gives \(\mathbb{E}[e^{sX}] \leq e^{s^2/2}\). We can verify: \(\mathbb{E}[e^{sX}] = \cosh(s) = 1 + s^2/2 + s^4/24 + \cdots \leq e^{s^2/2}\) (since \(\cosh(s) \leq e^{s^2/2}\) for all \(s\)).

4. Hoeffding’s Inequality

Now we combine Markov's inequality and Hoeffding's lemma to get exponential tail bounds for sums of independent bounded random variables.

Theorem (Hoeffding’s Inequality) Let \(X_1, \ldots, X_n\) be independent random variables with \(X_i \in [a_i, b_i]\). Let \(\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i\). Then for any \(t > 0\), $$ \Pr\!\left(\bar{X} - \mathbb{E}[\bar{X}] \geq t\right) \leq \exp\!\left(\frac{-2n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right). $$ By symmetry (applied to \(-X_i\)), the same bound holds for the left tail, so: $$ \Pr\!\left(|\bar{X} - \mathbb{E}[\bar{X}]| \geq t\right) \leq 2\exp\!\left(\frac{-2n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right). $$

Proof

Let \(S_n = \sum_{i=1}^n X_i\). We bound \(\Pr(S_n - \mathbb{E}[S_n] \geq nt)\).

Step 1: Chernoff trick. For any \(s > 0\):

$$ \Pr(S_n - \mathbb{E}[S_n] \geq nt) = \Pr\!\left(e^{s(S_n - \mathbb{E}[S_n])} \geq e^{snt}\right) \leq e^{-snt}\, \mathbb{E}\!\left[e^{s(S_n - \mathbb{E}[S_n])}\right] $$

where the last step is Markov's inequality applied to the non-negative random variable \(e^{s(S_n - \mathbb{E}[S_n])}\).

Step 2: Factor by independence. Since \(X_1, \ldots, X_n\) are independent:

$$ \mathbb{E}\!\left[e^{s(S_n - \mathbb{E}[S_n])}\right] = \prod_{i=1}^n \mathbb{E}\!\left[e^{s(X_i - \mathbb{E}[X_i])}\right]. $$

Step 3: Apply Hoeffding’s lemma. Each \(X_i - \mathbb{E}[X_i]\) is zero-mean and bounded in an interval of width \(b_i - a_i\), so:

$$ \mathbb{E}\!\left[e^{s(X_i - \mathbb{E}[X_i])}\right] \leq \exp\!\left(\frac{s^2(b_i - a_i)^2}{8}\right). $$

Step 4: Combine and optimize.

$$ \Pr(S_n - \mathbb{E}[S_n] \geq nt) \leq \exp\!\left(-snt + \frac{s^2}{8}\sum_{i=1}^n (b_i - a_i)^2\right). $$

The exponent is a quadratic in \(s\), minimized at \(s^* = \frac{4nt}{\sum (b_i - a_i)^2}\). Substituting:

$$ \Pr(\bar{X} - \mathbb{E}[\bar{X}] \geq t) \leq \exp\!\left(\frac{-2n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right). \quad \square $$

Example (Coin Flips Revisited) Flip a fair coin \(n = 100\) times. Each \(X_i \in \{0,1\}\), so \(b_i - a_i = 1\) and \(\mathbb{E}[\bar{X}] = 0.5\). What is \(\Pr(\bar{X} \geq 0.9)\)?

Hoeffding: \(\Pr(\bar{X} - 0.5 \geq 0.4) \leq e^{-2 \cdot 100 \cdot 0.16} = e^{-32} \approx 1.3 \times 10^{-14}\).

Compare this to Markov's bound of \(\approx 0.556\) from earlier. Hoeffding is astronomically tighter because it uses independence and boundedness, not just the mean.

The Chernoff Method The proof pattern — exponentiate, apply Markov, factor by independence, optimize — is called the Chernoff method. It's the engine behind virtually every concentration inequality in learning theory.

5. A First Attempt: Finite Hypothesis Classes

Armed with Hoeffding's inequality, we can immediately prove that ERM works for finite hypothesis classes. This first result is a warmup for the harder infinite case.

Recall: Union Bound For any events \(A_1, \ldots, A_k\): $$ \Pr\!\left(\bigcup_{i=1}^k A_i\right) \leq \sum_{i=1}^k \Pr(A_i). $$ This follows directly from sub-additivity of probability measures.

Theorem (Finite Class Bound) Let \(|\mathcal{H}| < \infty\). For a sample \(S\) of size \(m\) drawn i.i.d. from \(\mathcal{D}\): $$ \Pr_S\!\left[\sup_{h \in \mathcal{H}} |L_S(h) - L_{\mathcal{D}}(h)| > \epsilon\right] \leq 2|\mathcal{H}|\, e^{-2m\epsilon^2}. $$

Proof

Step 1: Fix a single \(h\). The empirical risk \(L_S(h) = \frac{1}{m}\sum_{i=1}^m \mathbf{1}[h(x_i) \neq y_i]\) is the average of \(m\) i.i.d. random variables, each in \([0, 1]\), with mean \(L_{\mathcal{D}}(h)\). By Hoeffding's two-sided inequality:

$$ \Pr\!\left[|L_S(h) - L_{\mathcal{D}}(h)| > \epsilon\right] \leq 2e^{-2m\epsilon^2}. $$

Step 2: Union bound over \(\mathcal{H}\).

$$ \Pr\!\left[\exists\, h \in \mathcal{H} : |L_S(h) - L_{\mathcal{D}}(h)| > \epsilon\right] \leq \sum_{h \in \mathcal{H}} 2e^{-2m\epsilon^2} = 2|\mathcal{H}|\, e^{-2m\epsilon^2}. \quad \square $$

Setting the bound equal to \(\delta\) and solving for \(m\):

$$ m \geq \frac{\ln(2|\mathcal{H}|/\delta)}{2\epsilon^2} \implies \Pr\!\left[\sup_h |L_S(h) - L_{\mathcal{D}}(h)| > \epsilon\right] < \delta. $$

Example Suppose \(\mathcal{H}\) has 1000 hypotheses and we want uniform convergence within \(\epsilon = 0.05\) with confidence \(\delta = 0.01\). We need: $$ m \geq \frac{\ln(2 \cdot 1000 / 0.01)}{2 \cdot 0.0025} = \frac{\ln(200{,}000)}{0.005} \approx \frac{12.2}{0.005} = 2{,}440 \text{ samples.} $$

6. Why Finite Classes Aren’t Enough

The finite class bound has a \(\ln|\mathcal{H}|\) term. If \(\mathcal{H}\) is infinite (e.g., all linear classifiers in \(\mathbb{R}^d\)), this blows up. But infinite classes clearly can be learnable — linear regression works.

The problem is that the union bound overcounts. When \(\mathcal{H}\) is infinite, many hypotheses behave identically on any finite sample. A linear classifier with weights \((1.0, 2.0)\) and one with weights \((1.0001, 2.0001)\) almost certainly agree on every training point. So the "effective size" of \(\mathcal{H}\) on \(m\) points is much smaller than \(|\mathcal{H}|\).

VC dimension captures this effective size.

7. VC Dimension

Definition (Restriction) Given a hypothesis class \(\mathcal{H}\) and a set \(C = \{x_1, \ldots, x_m\} \subseteq \mathcal{X}\), the restriction of \(\mathcal{H}\) to \(C\) is $$ \mathcal{H}_C = \{(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}\} \subseteq \{0,1\}^m. $$ This is the set of all labeling patterns that hypotheses in \(\mathcal{H}\) can produce on \(C\).

Definition (Shattering) \(\mathcal{H}\) shatters a set \(C\) if \(\mathcal{H}_C = \{0,1\}^{|C|}\). That is, for every possible labeling of the points in \(C\), there exists some \(h \in \mathcal{H}\) that realizes it. If \(|C| = m\), this means \(|\mathcal{H}_C| = 2^m\).

Definition (Growth Function) The growth function (or shattering coefficient) of \(\mathcal{H}\) is $$ \Pi_{\mathcal{H}}(m) = \max_{C \subseteq \mathcal{X},\, |C| = m} |\mathcal{H}_C|. $$ It measures the maximum number of distinct labeling patterns that \(\mathcal{H}\) can produce on any set of \(m\) points. Always \(\Pi_{\mathcal{H}}(m) \leq 2^m\).

Definition (VC Dimension) The Vapnik–Chervonenkis (VC) dimension of \(\mathcal{H}\) is $$ \operatorname{VCdim}(\mathcal{H}) = \max\{m : \Pi_{\mathcal{H}}(m) = 2^m\} = \max\{m : \exists\, C \text{ with } |C| = m \text{ that } \mathcal{H} \text{ shatters}\}. $$ If \(\mathcal{H}\) can shatter arbitrarily large sets, \(\operatorname{VCdim}(\mathcal{H}) = \infty\).

To prove \(\operatorname{VCdim}(\mathcal{H}) \geq d\), find one set of size \(d\) that is shattered. To prove \(\operatorname{VCdim}(\mathcal{H}) \leq d\), show that no set of size \(d+1\) is shattered.

Example 1: Threshold Classifiers on \(\mathbb{R}\)

Example (Thresholds) \(\mathcal{H} = \{h_a : h_a(x) = \mathbf{1}[x \geq a],\ a \in \mathbb{R}\}\).

\(\operatorname{VCdim} \geq 1\): Take \(C = \{5\}\). Setting \(a = 3\) gives label 1; setting \(a = 7\) gives label 0. Both labelings are realized. \(\checkmark\)

\(\operatorname{VCdim} \leq 1\): Take any \(C = \{x_1, x_2\}\) with \(x_1 < x_2\). The labeling \((1, 0)\) — where \(x_1\) is labeled 1 and \(x_2\) is labeled 0 — requires \(x_1 \geq a\) and \(x_2 < a\), which is impossible since \(x_1 < x_2\). So no two-element set is shattered.

Conclusion: \(\operatorname{VCdim}(\mathcal{H}) = 1\).

Example 2: Intervals on \(\mathbb{R}\)

Example (Intervals) \(\mathcal{H} = \{h_{a,b} : h_{a,b}(x) = \mathbf{1}[a \leq x \leq b],\ a \leq b\}\).

\(\operatorname{VCdim} \geq 2\): Take \(C = \{1, 3\}\).

  • Label \((0,0)\): choose \(a = 5, b = 6\) (interval misses both).
  • Label \((1,0)\): choose \(a = 0.5, b = 1.5\) (hits only 1).
  • Label \((0,1)\): choose \(a = 2.5, b = 3.5\) (hits only 3).
  • Label \((1,1)\): choose \(a = 0.5, b = 3.5\) (hits both).

All four labelings realized. \(\checkmark\)

\(\operatorname{VCdim} \leq 2\): Take any \(C = \{x_1, x_2, x_3\}\) with \(x_1 < x_2 < x_3\). The labeling \((1, 0, 1)\) requires \(x_1, x_3\) inside the interval but \(x_2\) outside it. Since intervals are contiguous, this is impossible.

Conclusion: \(\operatorname{VCdim}(\mathcal{H}) = 2\).

Example 3: Linear Classifiers in \(\mathbb{R}^d\)

A halfspace (or linear classifier) in \(\mathbb{R}^d\) is defined by a weight vector \(w \in \mathbb{R}^d\) and a bias \(b \in \mathbb{R}\):

$$ h_{w,b}(x) = \mathbf{1}[\langle w, x \rangle + b \geq 0]. $$

Geometrically, the equation \(\langle w, x \rangle + b = 0\) defines a hyperplane in \(\mathbb{R}^d\). The classifier labels everything on one side as 1 and everything on the other side as 0. The vector \(w\) is the normal to the hyperplane (it points toward the "1" side), and \(b\) shifts the hyperplane away from the origin.

The hypothesis class is \(\mathcal{H} = \{h_{w,b} : w \in \mathbb{R}^d,\, b \in \mathbb{R}\}\). We'll prove \(\operatorname{VCdim}(\mathcal{H}) = d + 1\).

Lower bound: \(\operatorname{VCdim} \geq d + 1\) (concrete example in \(\mathbb{R}^2\))

We need to exhibit \(d + 1\) points that halfspaces can shatter. Let's start with \(d = 2\) to build intuition, then generalize.

The points. Take the three points \(p_0 = (0, 0)\) (origin), \(p_1 = (1, 0)\) (= \(e_1\)), \(p_2 = (0, 1)\) (= \(e_2\)).

The claim. For every labeling \((y_0, y_1, y_2) \in \{0,1\}^3\), there exists a line (hyperplane in \(\mathbb{R}^2\)) that puts the 1-labeled points on one side and the 0-labeled points on the other. That's \(2^3 = 8\) labelings to check.

The construction. Given a desired labeling \((y_0, y_1, y_2)\), set:

$$ b = y_0 - \tfrac{1}{2}, \qquad w_1 = y_1 - y_0, \qquad w_2 = y_2 - y_0. $$

Why does this work? Let's verify by computing \(\langle w, p_i \rangle + b\) for each point:

  • At the origin \(p_0 = (0,0)\): \(\langle w, (0,0) \rangle + b = 0 + b = y_0 - \frac{1}{2}\). This is \(\geq 0\) when \(y_0 = 1\) (gives \(+\frac{1}{2}\)) and \(< 0\) when \(y_0 = 0\) (gives \(-\frac{1}{2}\)). So \(p_0\) gets label \(y_0\). \(\checkmark\)
  • At \(p_1 = (1, 0)\): \(\langle w, (1,0) \rangle + b = w_1 + b = (y_1 - y_0) + (y_0 - \frac{1}{2}) = y_1 - \frac{1}{2}\). This is \(\geq 0\) iff \(y_1 = 1\). So \(p_1\) gets label \(y_1\). \(\checkmark\)
  • At \(p_2 = (0, 1)\): \(\langle w, (0,1) \rangle + b = w_2 + b = (y_2 - y_0) + (y_0 - \frac{1}{2}) = y_2 - \frac{1}{2}\). This is \(\geq 0\) iff \(y_2 = 1\). So \(p_2\) gets label \(y_2\). \(\checkmark\)

The algebra has a clean structure: for basis vector \(e_i\), the inner product \(\langle w, e_i \rangle\) picks out \(w_i\), and the \(y_0\) terms cancel, leaving \(y_i - \frac{1}{2}\). Let's verify all 8 labelings explicitly:

Labeling \((y_0, y_1, y_2)\) \(w = (w_1, w_2)\) \(b\) Scores: \(p_0,\ p_1,\ p_2\) Labels
(0, 0, 0)(0, 0)\(-\frac{1}{2}\)\(-\frac{1}{2},\ -\frac{1}{2},\ -\frac{1}{2}\)0, 0, 0 \(\checkmark\)
(1, 0, 0)(\(-1, -1\))\(+\frac{1}{2}\)\(+\frac{1}{2},\ -\frac{1}{2},\ -\frac{1}{2}\)1, 0, 0 \(\checkmark\)
(0, 1, 0)(1, 0)\(-\frac{1}{2}\)\(-\frac{1}{2},\ +\frac{1}{2},\ -\frac{1}{2}\)0, 1, 0 \(\checkmark\)
(0, 0, 1)(0, 1)\(-\frac{1}{2}\)\(-\frac{1}{2},\ -\frac{1}{2},\ +\frac{1}{2}\)0, 0, 1 \(\checkmark\)
(1, 1, 0)(0, \(-1\))\(+\frac{1}{2}\)\(+\frac{1}{2},\ +\frac{1}{2},\ -\frac{1}{2}\)1, 1, 0 \(\checkmark\)
(1, 0, 1)(\(-1\), 0)\(+\frac{1}{2}\)\(+\frac{1}{2},\ -\frac{1}{2},\ +\frac{1}{2}\)1, 0, 1 \(\checkmark\)
(0, 1, 1)(1, 1)\(-\frac{1}{2}\)\(-\frac{1}{2},\ +\frac{1}{2},\ +\frac{1}{2}\)0, 1, 1 \(\checkmark\)
(1, 1, 1)(0, 0)\(+\frac{1}{2}\)\(+\frac{1}{2},\ +\frac{1}{2},\ +\frac{1}{2}\)1, 1, 1 \(\checkmark\)

All 8 labelings are realized. These three points are shattered by halfspaces in \(\mathbb{R}^2\).

Lower bound: the general construction

The same idea works in \(\mathbb{R}^d\). Take the \(d + 1\) points \(\{0, e_1, e_2, \ldots, e_d\}\) where \(e_i\) is the \(i\)-th standard basis vector (all zeros except a 1 in position \(i\)) and \(0\) is the origin. For any labeling \((y_0, y_1, \ldots, y_d) \in \{0,1\}^{d+1}\), set:

$$ b = y_0 - \tfrac{1}{2}, \qquad w_i = y_i - y_0 \quad \text{for } i = 1, \ldots, d. $$

The same calculation as the \(\mathbb{R}^2\) case goes through in every dimension:

  • At the origin: \(\langle w, 0 \rangle + b = b = y_0 - \frac{1}{2}\), which is \(\geq 0\) iff \(y_0 = 1\).
  • At \(e_i\): \(\langle w, e_i \rangle + b = w_i + b = (y_i - y_0) + (y_0 - \frac{1}{2}) = y_i - \frac{1}{2}\), which is \(\geq 0\) iff \(y_i = 1\).

Every labeling is realized, so \(\operatorname{VCdim}(\mathcal{H}) \geq d + 1\).

Upper bound: \(\operatorname{VCdim} \leq d + 1\) (via Radon’s Theorem)

We need to show that no set of \(d + 2\) points in \(\mathbb{R}^d\) can be shattered by halfspaces. The key tool is Radon's theorem, a fundamental result in convex geometry that we'll now state, prove, and illustrate.

Theorem (Radon, 1921) Any set of \(d + 2\) points in \(\mathbb{R}^d\) can be partitioned into two non-empty subsets whose convex hulls intersect.

Intuition in \(\mathbb{R}^1\). Take any 3 points on the real line, say \(p_1 = 1,\ p_2 = 4,\ p_3 = 7\). The "middle" point \(p_2 = 4\) lies inside the segment \([1, 7]\) connecting the other two. So the partition \(I = \{p_2\},\ J = \{p_1, p_3\}\) has \(\operatorname{conv}(I) = \{4\}\) and \(\operatorname{conv}(J) = [1, 7]\), and \(4 \in [1, 7]\). The convex hulls intersect. This always works in \(\mathbb{R}^1\) for any 3 points — the median is always inside the interval formed by the other two.

In higher dimensions, the partition isn't always as obvious, but the theorem guarantees one always exists. We'll see a concrete picture in \(\mathbb{R}^2\) after the proof, when we work through a full numerical example.

Proof of Radon’s Theorem

The goal. We want to split \(d+2\) points into two groups \(I\) and \(J\) and find a point \(q\) that is simultaneously in the convex hull of both groups. That means we need:

$$ q = \sum_{i \in I} \alpha_i\, p_i = \sum_{j \in J} \beta_j\, p_j $$

where the \(\alpha_i\) are non-negative and sum to 1 (making the left side a convex combination of the \(I\)-points) and the \(\beta_j\) are non-negative and sum to 1 (making the right side a convex combination of the \(J\)-points).

The trick is to rewrite this condition in a form that linear algebra can hand us for free.

Step 1: Reformulate as a single equation. Move everything to one side:

$$ \sum_{i \in I} \alpha_i\, p_i - \sum_{j \in J} \beta_j\, p_j = 0. $$

Define new variables: \(\lambda_i = \alpha_i\) for \(i \in I\) (positive) and \(\lambda_j = -\beta_j\) for \(j \in J\) (negative). Then the equation becomes:

$$ \sum_{k=1}^{d+2} \lambda_k\, p_k = 0. \qquad \text{(Equation 1)} $$

What about the constraint that the \(\alpha\)'s sum to 1 and the \(\beta\)'s sum to 1? That means \(\sum_{i \in I} \alpha_i - \sum_{j \in J} \beta_j = 1 - 1 = 0\), which in terms of the \(\lambda\)'s is:

$$ \sum_{k=1}^{d+2} \lambda_k = 0. \qquad \text{(Equation 2)} $$

So the two equations together are exactly the condition for overlapping convex hulls, repackaged into a form where all the unknowns \((\lambda_1, \ldots, \lambda_{d+2})\) appear in one system. The positive \(\lambda\)'s correspond to one group, the negative \(\lambda\)'s to the other.

But we don't yet know the partition \((I, J)\) — that's what we're trying to find! The insight: solve the linear system first, then read off the partition from the signs of the solution.

Step 2: The system always has a non-trivial solution. Let's count equations vs. unknowns:

  • Equation 1 (\(\sum \lambda_k p_k = 0\)): Since each \(p_k \in \mathbb{R}^d\), this is really \(d\) scalar equations, one per coordinate. In \(\mathbb{R}^2\) for example, it splits into an \(x\)-equation and a \(y\)-equation.
  • Equation 2 (\(\sum \lambda_k = 0\)): This is 1 scalar equation.

Total: \(d + 1\) equations in \(d + 2\) unknowns. Since there are more unknowns than equations, the system is underdetermined: linear algebra guarantees at least one non-trivial solution exists. (Formally: the null space of a \((d+1) \times (d+2)\) matrix has dimension \(\geq 1\).)

Step 3: Read off the partition. Fix a non-trivial solution \((\lambda_1, \ldots, \lambda_{d+2})\). Equation 2 says \(\sum \lambda_k = 0\), and not all \(\lambda_k\) are zero, so there must be at least one positive and at least one negative \(\lambda_k\). (If they were all \(\geq 0\) and summed to zero, they'd all be zero — contradicting non-triviality.) Define:

$$ I = \{k : \lambda_k > 0\}, \qquad J = \{k : \lambda_k \leq 0\}. $$

Both \(I\) and \(J\) are non-empty. Let \(\Lambda = \sum_{i \in I} \lambda_i\). Since \(\sum_k \lambda_k = 0\), we get \(\Lambda = -\sum_{j \in J} \lambda_j > 0\).

Step 4: Extract the intersection point. From Equation 1 (\(\sum \lambda_k p_k = 0\)), move the negative terms to the other side:

$$ \sum_{i \in I} \lambda_i\, p_i = -\sum_{j \in J} \lambda_j\, p_j = \sum_{j \in J} |\lambda_j|\, p_j. $$

Divide both sides by \(\Lambda\):

$$ \underbrace{\sum_{i \in I} \frac{\lambda_i}{\Lambda}\, p_i}_{\text{convex combination of } \{p_i\}_{i \in I}} = \underbrace{\sum_{j \in J} \frac{|\lambda_j|}{\Lambda}\, p_j}_{\text{convex combination of } \{p_j\}_{j \in J}}. $$

Why are these convex combinations?

  • Left side: Each \(\lambda_i / \Lambda\) is positive (since \(i \in I\) means \(\lambda_i > 0\)), and they sum to \(\Lambda / \Lambda = 1\). So this is a weighted average of the \(I\)-points, which lies in \(\operatorname{conv}(\{p_i : i \in I\})\).
  • Right side: Each \(|\lambda_j| / \Lambda\) is non-negative, and they sum to \((\sum_{j \in J} |\lambda_j|) / \Lambda = \Lambda / \Lambda = 1\). So this is a weighted average of the \(J\)-points, which lies in \(\operatorname{conv}(\{p_j : j \in J\})\).

Both sides are equal, so the common point lies in both convex hulls. \(\square\)

x y 0 1.5 3 0 1.5 3 (1.5, 1.5) p₁ (0,0) p₂ (3,3) p₃ (3,0) p₄ (0,3) I = {p₁, p₂} — conv(I) J = {p₃, p₄} — conv(J)
Radon's theorem in \(\mathbb{R}^2\): 4 points are partitioned into I = {p₁, p₂} and J = {p₃, p₄}. The two segments (convex hulls) cross at (1.5, 1.5).

Radon’s Theorem: Worked Example in \(\mathbb{R}^2\)

Take the 4 points \(p_1 = (0,0),\ p_2 = (3,3),\ p_3 = (3,0),\ p_4 = (0,3)\) from the figure above. We need \(\lambda_1 p_1 + \lambda_2 p_2 + \lambda_3 p_3 + \lambda_4 p_4 = 0\) and \(\lambda_1 + \lambda_2 + \lambda_3 + \lambda_4 = 0\).

Writing out the equations:

  • \(x\)-coordinates: \(0\lambda_1 + 3\lambda_2 + 3\lambda_3 + 0\lambda_4 = 0 \implies \lambda_2 + \lambda_3 = 0\).
  • \(y\)-coordinates: \(0\lambda_1 + 3\lambda_2 + 0\lambda_3 + 3\lambda_4 = 0 \implies \lambda_2 + \lambda_4 = 0\).
  • Sum: \(\lambda_1 + \lambda_2 + \lambda_3 + \lambda_4 = 0\).

From the first two: \(\lambda_3 = -\lambda_2\) and \(\lambda_4 = -\lambda_2\). Substituting into the sum: \(\lambda_1 + \lambda_2 - \lambda_2 - \lambda_2 = 0 \implies \lambda_1 = \lambda_2\).

Choose \(\lambda_2 = 1\): then \(\lambda_1 = 1,\ \lambda_3 = -1,\ \lambda_4 = -1\).

Partition: \(I = \{1, 2\}\) (positive \(\lambda\)'s), \(J = \{3, 4\}\) (negative \(\lambda\)'s). \(\Lambda = 1 + 1 = 2\).

Intersection point:

$$ \frac{1}{2}p_1 + \frac{1}{2}p_2 = \frac{1}{2}(0,0) + \frac{1}{2}(3,3) = (1.5, 1.5). $$ $$ \frac{1}{2}p_3 + \frac{1}{2}p_4 = \frac{1}{2}(3,0) + \frac{1}{2}(0,3) = (1.5, 1.5). \quad \checkmark $$

The midpoint of the diagonal \(p_1 p_2\) equals the midpoint of the diagonal \(p_3 p_4\). Both convex hulls pass through \((1.5, 1.5)\).

Applying Radon’s Theorem to Halfspaces

Now we use Radon's theorem to show that no set of \(d+2\) points can be shattered by halfspaces.

Proof that \(\operatorname{VCdim}(\text{halfspaces in } \mathbb{R}^d) \leq d + 1\)

Take any set of \(d+2\) points \(\{p_1, \ldots, p_{d+2}\} \subseteq \mathbb{R}^d\). By Radon's theorem, there is a partition into non-empty sets \(I\) and \(J\) with a point \(q \in \operatorname{conv}(\{p_i\}_{i \in I}) \cap \operatorname{conv}(\{p_j\}_{j \in J})\).

Consider the labeling that assigns 1 to all points in \(I\) and 0 to all points in \(J\). We'll show no halfspace can realize this labeling.

Suppose for contradiction that some halfspace \(h_{w,b}\) realizes it. Then:

  • For all \(i \in I\): \(\langle w, p_i \rangle + b \geq 0\) (labeled 1).
  • For all \(j \in J\): \(\langle w, p_j \rangle + b < 0\) (labeled 0).

Since \(q \in \operatorname{conv}(\{p_i\}_{i \in I})\), we can write \(q = \sum_{i \in I} \alpha_i p_i\) with \(\alpha_i \geq 0\) and \(\sum \alpha_i = 1\). Then:

$$ \langle w, q \rangle + b = \sum_{i \in I} \alpha_i \underbrace{(\langle w, p_i \rangle + b)}_{\geq\, 0} \geq 0. $$

Since \(q \in \operatorname{conv}(\{p_j\}_{j \in J})\), we can also write \(q = \sum_{j \in J} \beta_j p_j\) with \(\beta_j \geq 0\) and \(\sum \beta_j = 1\). Then:

$$ \langle w, q \rangle + b = \sum_{j \in J} \beta_j \underbrace{(\langle w, p_j \rangle + b)}_{<\, 0} < 0. $$

So \(\langle w, q \rangle + b\) is both \(\geq 0\) and \(< 0\). Contradiction.

Therefore the labeling "1 on \(I\), 0 on \(J\)" cannot be realized by any halfspace, so these \(d+2\) points are not shattered. Since the points were arbitrary, no set of \(d+2\) points can be shattered. \(\square\)

Concrete Failure in \(\mathbb{R}^2\)

With the 4 points from our Radon example, the labeling \(p_1 = 1, p_2 = 1, p_3 = 0, p_4 = 0\) asks for a line that puts \((0,0)\) and \((3,3)\) on one side and \((3,0)\) and \((0,3)\) on the other. But the segment from \((0,0)\) to \((3,3)\) and the segment from \((3,0)\) to \((0,3)\) cross at \((1.5, 1.5)\). Any line separating the two pairs would have to separate a point from itself. Impossible.

Conclusion: \(\operatorname{VCdim}(\text{halfspaces in } \mathbb{R}^d) = d + 1\).

Example 4: All Functions

Example (Infinite VC Dimension) Let \(\mathcal{X} = \mathbb{R}\) and \(\mathcal{H} = \{0,1\}^{\mathbb{R}}\) (all functions from \(\mathbb{R}\) to \(\{0,1\}\)). Then for any set \(C\) of any size, every labeling is realized by some function. So \(\operatorname{VCdim}(\mathcal{H}) = \infty\).

As we'll see, this class is not PAC learnable — a hypothesis class that can represent anything can learn nothing.

8. The Sauer–Shelah Lemma

The VC dimension tells us the largest set that can be shattered. The Sauer–Shelah lemma says something much stronger: once the VC dimension is finite, the growth function is polynomial, not exponential. This is the key combinatorial fact that makes learning possible.

Theorem (Sauer–Shelah Lemma) Let \(\operatorname{VCdim}(\mathcal{H}) = d\). Then for all \(m\): $$ \Pi_{\mathcal{H}}(m) \leq \Phi(m, d) \quad \text{where} \quad \Phi(m, d) = \sum_{i=0}^{d} \binom{m}{i}. $$

This formula \(\Phi(m, d)\) appears out of nowhere. Before proving the lemma, let's understand where it comes from and why it's the natural candidate.

Where does \(\Phi(m, d)\) come from?

The question: we have \(m\) points and a hypothesis class with VC dimension \(d\). We know that \(\mathcal{H}\) can shatter sets of size \(\leq d\) but cannot shatter any set of size \(d + 1\). How many distinct labeling patterns can \(\mathcal{H}\) produce on \(m\) points?

The upper-bound intuition: think about it from the perspective of subsets. A labeling pattern on \(m\) points is a binary string like \((0, 1, 1, 0, 1, 0, \ldots)\). For a subset \(T \subseteq S\) of the \(m\) points, we say \(\mathcal{H}\) shatters \(T\) if \(\mathcal{H}\) can produce every possible labeling of \(T\) (no matter what labels the other points get). The VC dimension constraint says: no subset of size \(> d\) is shattered.

Now, the maximum number of labelings happens when \(\mathcal{H}\) shatters as many subsets as it can — specifically, every subset of size \(\leq d\) (it's allowed to shatter those) but no subset of size \(d + 1\) (it's forbidden from shattering those).

There's a beautiful combinatorial result (proved in the appendix of Sauer's original paper) that connects "which subsets are shattered" to "how many labeling patterns exist": the number of distinct patterns on \(m\) points is at most the number of subsets that are shattered. The intuition is that each shattered subset \(T\) contributes one "degree of freedom" — the ability to independently choose labels for the points in \(T\) — and each degree of freedom corresponds to one additional labeling pattern.

So if \(\mathcal{H}\) can shatter every subset of size \(\leq d\) but nothing larger, how many subsets is that? It's exactly the number of subsets of \(\{1, \ldots, m\}\) with size \(0, 1, 2, \ldots, d\):

$$ \text{subsets of size } 0 + \text{subsets of size } 1 + \cdots + \text{subsets of size } d = \binom{m}{0} + \binom{m}{1} + \cdots + \binom{m}{d} = \Phi(m, d). $$

Each term \(\binom{m}{i}\) counts the number of ways to choose \(i\) points out of \(m\). So:

  • \(\binom{m}{0} = 1\): the empty subset (always "shattered" vacuously).
  • \(\binom{m}{1} = m\): each individual point can be shattered (given both labels).
  • \(\binom{m}{2} = \frac{m(m-1)}{2}\): each pair of points can be shattered (given all 4 labelings).
  • \(\vdots\)
  • \(\binom{m}{d}\): each subset of exactly \(d\) points can be shattered.

Beyond size \(d\), the VC dimension constraint kicks in and no more subsets can be shattered. So \(\Phi(m, d)\) is the maximum number of "independently controllable" subsets — and hence the maximum number of distinct labelings.

Example: Intervals (\(d = 2\)) on \(m = 5\) Points With 5 points on the real line and the interval hypothesis class (VC dim = 2):

  • Subsets of size 0: \(\binom{5}{0} = 1\) (the empty labeling).
  • Subsets of size 1: \(\binom{5}{1} = 5\) (each individual point can be labeled 1 while the rest are 0).
  • Subsets of size 2: \(\binom{5}{2} = 10\) (each pair of adjacent or non-adjacent points can form an interval — actually, intervals can only shatter contiguous pairs, but the bound counts all subsets of size \(\leq d\) as a worst-case upper bound).

Total: \(\Phi(5, 2) = 1 + 5 + 10 = 16\). And indeed, we verified earlier that intervals produce exactly 16 distinct patterns on 5 ordered points.

Numerical Checks at Scale

  • \(d = 1, m = 10\): \(\Phi(10,1) = 1 + 10 = 11\), versus \(2^{10} = 1024\).
  • \(d = 3, m = 10\): \(\Phi(10,3) = 1 + 10 + 45 + 120 = 176\), versus \(2^{10} = 1024\).
  • \(d = 3, m = 100\): \(\Phi(100,3) = 1 + 100 + 4950 + 161700 = 166{,}751\), versus \(2^{100} \approx 10^{30}\).

For large \(m\), the growth function is \(O(m^d)\) instead of \(2^m\). This is a dramatic reduction: polynomial instead of exponential.

Now let's prove that \(\Phi(m, d)\) is indeed an upper bound on the growth function.

Proof of the Sauer–Shelah Lemma

We prove: for any finite set \(S\) with \(|S| = m\), \(|\mathcal{H}_S| \leq \Phi(m, d)\).

Induction on \(m\).

Base case \(m = 1\): \(|\mathcal{H}_S| \leq 2 = 1 + 1 = \Phi(1, d)\) for any \(d \geq 1\). If \(d = 0\), then \(\mathcal{H}\) cannot shatter any single point, so each point gets at most one label, giving \(|\mathcal{H}_S| \leq 1 = \Phi(1, 0)\). \(\checkmark\)

Inductive step. Fix a set \(S\) with \(|S| = m \geq 2\). Pick any point \(x \in S\) and let \(S' = S \setminus \{x\}\).

Partition the restriction patterns in \(\mathcal{H}_S\) according to how they behave on \(x\):

  • Let \(\mathcal{H}_{S'}\) be the restriction of \(\mathcal{H}\) to \(S'\) (ignoring \(x\)). Each pattern in \(\mathcal{H}_{S'}\) is a sequence in \(\{0,1\}^{m-1}\).
  • Some patterns on \(S'\) appear with both labels on \(x\) (both 0 and 1). Call the set of such patterns \(\mathcal{H}^{\pm}\).
  • The remaining patterns on \(S'\) appear with only one label on \(x\).

Counting: Each pattern in \(\mathcal{H}_{S'} \setminus \mathcal{H}^{\pm}\) contributes 1 to \(|\mathcal{H}_S|\) (one label on \(x\)), and each pattern in \(\mathcal{H}^{\pm}\) contributes 2. So:

$$ |\mathcal{H}_S| = |\mathcal{H}_{S'}| + |\mathcal{H}^{\pm}|. $$

Key claim: \(\operatorname{VCdim}(\mathcal{H}^{\pm}) \leq d - 1\).

Why? Suppose \(\mathcal{H}^{\pm}\) shatters some set \(T \subseteq S'\). Then for every labeling of \(T\), there is a pattern in \(\mathcal{H}^{\pm}\) realizing it. By definition of \(\mathcal{H}^{\pm}\), each such pattern extends to both labels on \(x\). So for every labeling of \(T \cup \{x\}\), there is a hypothesis in \(\mathcal{H}\) realizing it. This means \(\mathcal{H}\) shatters \(T \cup \{x\}\), so \(|T| + 1 \leq d\), i.e., \(|T| \leq d - 1\).

Applying the inductive hypothesis:

  • \(|\mathcal{H}_{S'}| \leq \Phi(m-1, d)\) (restriction to \(m-1\) points, VC dim at most \(d\)).
  • \(|\mathcal{H}^{\pm}| \leq \Phi(m-1, d-1)\) (restriction to \(m-1\) points, VC dim at most \(d-1\)).

Finishing the induction: we need \(\Phi(m-1,d) + \Phi(m-1,d-1) = \Phi(m,d)\).

This is a "Pascal-style" identity for the function \(\Phi\). It's analogous to (and proved using) the classical Pascal's rule for binomial coefficients. Let's recall that first:

Pascal’s Rule (Reminder) For any \(n \geq 1\) and \(1 \leq k \leq n\): $$ \binom{n-1}{k} + \binom{n-1}{k-1} = \binom{n}{k}. $$ Why: Consider \(n\) objects and count the subsets of size \(k\). Fix one object. Either it's not in the subset (choose \(k\) from the remaining \(n-1\): \(\binom{n-1}{k}\) ways) or it is (choose the other \(k-1\) from \(n-1\): \(\binom{n-1}{k-1}\) ways). These two cases are exhaustive and disjoint, so they sum to \(\binom{n}{k}\).

Now we show the identity step by step. First, expand both \(\Phi\) terms using their definitions:

$$ \Phi(m-1,d) + \Phi(m-1,d-1) = \underbrace{\sum_{i=0}^{d}\binom{m-1}{i}}_{\text{first sum}} + \underbrace{\sum_{i=0}^{d-1}\binom{m-1}{i}}_{\text{second sum}}. $$

The idea: we want to pair up terms from the two sums so that each pair collapses via Pascal's rule into a single \(\binom{m}{\cdot}\). The sums don't have the same number of terms (the first goes up to \(d\), the second up to \(d-1\)), so we'll peel off one extra term from the first sum and pair the rest.

Step 1: Peel off the \(i = 0\) term from the first sum.

$$ = \binom{m-1}{0} + \sum_{i=1}^{d}\binom{m-1}{i} + \sum_{i=0}^{d-1}\binom{m-1}{i}. $$

Step 2: Reindex the second sum. Substitute \(j = i + 1\), so when \(i\) goes from \(0\) to \(d-1\), \(j\) goes from \(1\) to \(d\), and \(\binom{m-1}{i} = \binom{m-1}{j-1}\):

$$ = \binom{m-1}{0} + \sum_{i=1}^{d}\binom{m-1}{i} + \sum_{j=1}^{d}\binom{m-1}{j-1}. $$

Now both sums run from \(1\) to \(d\). Rename \(j\) back to \(i\) in the second sum (it's just a dummy variable) and combine:

$$ = \binom{m-1}{0} + \sum_{i=1}^{d}\left[\binom{m-1}{i} + \binom{m-1}{i-1}\right]. $$

Step 3: Apply Pascal’s rule to each pair. For \(i \geq 1\): \(\binom{m-1}{i} + \binom{m-1}{i-1} = \binom{m}{i}\). So:

$$ = \underbrace{\binom{m-1}{0}}_{= 1 = \binom{m}{0}} + \sum_{i=1}^{d}\binom{m}{i} = \sum_{i=0}^{d}\binom{m}{i} = \Phi(m, d). $$

(In the first step we used \(\binom{m-1}{0} = 1 = \binom{m}{0}\).)

Numerical Check Let \(m = 5, d = 2\):

  • \(\Phi(4, 2) = \binom{4}{0} + \binom{4}{1} + \binom{4}{2} = 1 + 4 + 6 = 11\).
  • \(\Phi(4, 1) = \binom{4}{0} + \binom{4}{1} = 1 + 4 = 5\).
  • \(\Phi(4, 2) + \Phi(4, 1) = 11 + 5 = 16\).
  • \(\Phi(5, 2) = \binom{5}{0} + \binom{5}{1} + \binom{5}{2} = 1 + 5 + 10 = 16\). \(\checkmark\)

So \(|\mathcal{H}_S| \leq \Phi(m-1, d) + \Phi(m-1, d-1) = \Phi(m, d)\). \(\square\)

The Polynomial Corollary

For sample complexity results, we need a cleaner upper bound on \(\Phi(m, d)\):

Corollary For \(m \geq d \geq 1\): $$ \Phi(m, d) = \sum_{i=0}^{d}\binom{m}{i} \leq \left(\frac{em}{d}\right)^d. $$

Proof

Step 1. Consider the binomial expansion \((1 + t)^m = \sum_{i=0}^{m} \binom{m}{i} t^i\) for \(t > 0\). Since all terms are positive:

$$ \sum_{i=0}^{d}\binom{m}{i} t^i \leq (1 + t)^m. $$

Step 2. For \(i \leq d\) and \(0 < t \leq 1\), we have \(t^i \geq t^d\). So:

$$ t^d \sum_{i=0}^{d}\binom{m}{i} \leq \sum_{i=0}^{d}\binom{m}{i} t^i \leq (1+t)^m. $$

Step 3. Dividing by \(t^d\):

$$ \sum_{i=0}^{d}\binom{m}{i} \leq \frac{(1+t)^m}{t^d}. $$

Step 4. Set \(t = d/m\) (which is \(\leq 1\) since \(m \geq d\)):

$$ \sum_{i=0}^{d}\binom{m}{i} \leq \frac{(1 + d/m)^m}{(d/m)^d} = \frac{(1 + d/m)^m \cdot m^d}{d^d}. $$

Step 5. The standard inequality \((1 + x/n)^n \leq e^x\) gives \((1 + d/m)^m \leq e^d\). So:

$$ \sum_{i=0}^{d}\binom{m}{i} \leq \frac{e^d \cdot m^d}{d^d} = \left(\frac{em}{d}\right)^d. \quad \square $$

What Sauer–Shelah Really Says The growth function transitions sharply: while \(m \leq d\), we can have \(\Pi_{\mathcal{H}}(m) = 2^m\) (exponential). The moment \(m > d\), the growth function is forced to be at most polynomial: \(O(m^d)\). There is no middle ground. This phase transition is what makes finite VC dimension so powerful — it converts a combinatorial fact (can't shatter large sets) into a quantitative bound (polynomial growth) that feeds directly into generalization bounds.

9. From VC Dimension to Generalization: Symmetrization

Taking Stock: What We Have and What We Need

Let's pause and see where we are. We have two powerful tools:

  1. Hoeffding's inequality (Section 4): for a single, fixed hypothesis \(h\), the training error \(L_S(h)\) concentrates around the true error \(L_{\mathcal{D}}(h)\) exponentially fast.
  2. Sauer–Shelah lemma (Section 8): if \(\operatorname{VCdim}(\mathcal{H}) = d\), then on any \(m\) points, \(\mathcal{H}\) produces at most \((em/d)^d\) distinct labeling patterns — polynomial, not exponential.

And we have one goal:

Goal: Bound \(\displaystyle\Pr_S\!\left[\sup_{h \in \mathcal{H}} |L_S(h) - L_{\mathcal{D}}(h)| > \epsilon\right]\) — the probability that training error is a bad estimate of true error for some hypothesis.

The naive attempt would be: apply Hoeffding to each hypothesis individually, then take a union bound over all \(h \in \mathcal{H}\). For a finite class this works beautifully (Section 5). But for an infinite class, even with the Sauer–Shelah polynomial bound on the growth function, there's a subtle problem.

The problem: Sauer–Shelah tells us that on any fixed set of \(m\) points, there are at most \(\Pi_{\mathcal{H}}(m)\) distinct behaviors. But we want to union-bound over the behaviors on a random sample \(S\) — and the sample is the thing we're taking probabilities over. We can't simultaneously fix the sample (to apply Sauer–Shelah) and randomize over it (to apply Hoeffding). This is a chicken-and-egg problem.

There's a second problem: the quantity we're bounding involves \(L_{\mathcal{D}}(h)\), the true risk, which depends on the unknown distribution \(\mathcal{D}\). We can't condition on it, compute it, or even evaluate it for specific hypotheses. It's a ghost.

The Key Idea: Replace the Ghost with a Ghost You Can See

The symmetrization trick, introduced by Vapnik and Chervonenkis in their landmark 1971 paper, solves both problems in one stroke. The idea is strikingly simple:

The Ghost Sample Trick (Vapnik & Chervonenkis, 1971)

Imagine you had a second, independent training set \(S' = ((x'_1, y'_1), \ldots, (x'_m, y'_m))\) drawn from the same distribution \(\mathcal{D}\). You don't actually have this sample — it's a thought experiment. But since \(S'\) is drawn from \(\mathcal{D}\), its empirical risk \(L_{S'}(h)\) is an unbiased estimate of the true risk \(L_{\mathcal{D}}(h)\).

So instead of comparing \(L_S(h)\) to the unknowable \(L_{\mathcal{D}}(h)\), compare it to the concrete \(L_{S'}(h)\). If \(L_S(h)\) is far from \(L_{\mathcal{D}}(h)\), then with decent probability, \(L_S(h)\) is also far from \(L_{S'}(h)\) — because \(L_{S'}(h)\) is close to \(L_{\mathcal{D}}(h)\).

Why does this help? Because once we're comparing \(L_S(h)\) to \(L_{S'}(h)\) (two concrete samples), we can:

  1. Fix the combined sample \(C = S \cup S'\) (a pool of \(2m\) data points) and condition on it.
  2. Apply Sauer–Shelah to \(C\): there are at most \(\Pi_{\mathcal{H}}(2m)\) distinct behavior patterns on these \(2m\) fixed points.
  3. Randomize only over the split: conditioned on \(C\), the only randomness is which \(m\) points go to \(S\) and which go to \(S'\). This is like a random permutation, and we can apply Hoeffding to each of the \(\Pi_{\mathcal{H}}(2m)\) patterns.

This resolves the chicken-and-egg problem: fix the points (enabling Sauer–Shelah), randomize only over the partition (enabling Hoeffding). Brilliant.

The Symmetrization Lemma

Let's make this precise. The first step is showing that the ghost sample \(S'\) is a good stand-in for \(L_{\mathcal{D}}\):

Lemma (Symmetrization) For \(m \geq 2/\epsilon^2\): $$ \Pr_S\!\left[\sup_{h \in \mathcal{H}} \left(L_{\mathcal{D}}(h) - L_S(h)\right) > \epsilon\right] \leq 2\, \Pr_{S, S'}\!\left[\sup_{h \in \mathcal{H}} \left(L_{S'}(h) - L_S(h)\right) > \frac{\epsilon}{2}\right]. $$

In words: if some hypothesis has a big gap between true and training error (left side), then with decent probability, a ghost sample will also reveal a big gap (right side). The \(\epsilon\) shrinks to \(\epsilon/2\) and we pay a factor of 2 — small prices for eliminating the unknown \(\mathcal{D}\).

Proof

Setup. Suppose the bad event occurs: \(\sup_h (L_{\mathcal{D}}(h) - L_S(h)) > \epsilon\). This means there exists a hypothesis \(h^*\) with \(L_{\mathcal{D}}(h^*) > L_S(h^*) + \epsilon\). The true error of \(h^*\) is much bigger than its training error — \(h^*\) is overfitting.

Step 1: The ghost sample probably agrees with the truth. Draw \(S'\) independently from \(\mathcal{D}\). Since \(L_{S'}(h^*)\) is the average of \(m\) i.i.d. Bernoulli random variables with mean \(L_{\mathcal{D}}(h^*)\), Hoeffding's inequality gives:

$$ \Pr_{S'}\!\left[L_{S'}(h^*) < L_{\mathcal{D}}(h^*) - \frac{\epsilon}{2}\right] \leq e^{-2m(\epsilon/2)^2} = e^{-m\epsilon^2/2}. $$

For \(m \geq 2/\epsilon^2\), the exponent is \(\leq -1\), so this probability is \(\leq e^{-1} < 1/2\). Therefore:

$$ \Pr_{S'}\!\left[L_{S'}(h^*) \geq L_{\mathcal{D}}(h^*) - \frac{\epsilon}{2}\right] \geq \frac{1}{2}. $$

In plain English: the ghost sample's error for \(h^*\) is within \(\epsilon/2\) of the true error, with probability at least \(1/2\).

Step 2: Chain the inequalities. When the ghost concentrates (which happens with probability \(\geq 1/2\)):

$$ L_{S'}(h^*) \geq L_{\mathcal{D}}(h^*) - \frac{\epsilon}{2} > \left(L_S(h^*) + \epsilon\right) - \frac{\epsilon}{2} = L_S(h^*) + \frac{\epsilon}{2}. $$

So \(L_{S'}(h^*) - L_S(h^*) > \epsilon/2\), which means \(\sup_h (L_{S'}(h) - L_S(h)) > \epsilon/2\).

Step 3: Combine the two events into one probability.

Let's name the two events to keep things clear:

  • Event \(A\) (the "bad event," depends on \(S\)): \(\sup_h (L_{\mathcal{D}}(h) - L_S(h)) > \epsilon\).
  • Event \(B\) (the "ghost event," depends on \(S\) and \(S'\)): \(\sup_h (L_{S'}(h) - L_S(h)) > \epsilon/2\).

Steps 1–2 proved: whenever \(A\) happens, \(B\) happens with probability \(\geq 1/2\) over \(S'\). Written as a conditional probability:

$$ \Pr_{S'}[B \mid A] \geq \frac{1}{2}. $$

Now we use the definition of conditional probability. Recall that \(\Pr[B \mid A] = \Pr[A \text{ and } B] / \Pr[A]\), so:

$$ \Pr[A \text{ and } B] = \Pr[A] \cdot \Pr[B \mid A]. $$

Since \(A \text{ and } B\) implies \(B\) (if both happen, then certainly \(B\) happens):

$$ \Pr[B] \geq \Pr[A \text{ and } B] = \Pr[A] \cdot \Pr[B \mid A] \geq \Pr[A] \cdot \frac{1}{2}. $$

Rearranging:

$$ \Pr[A] \leq 2 \cdot \Pr[B]. $$

Substituting back the event names:

$$ \underbrace{\Pr_S\!\left[\sup_h (L_{\mathcal{D}}(h) - L_S(h)) > \epsilon\right]}_{\text{what we want to bound}} \leq 2 \cdot \underbrace{\Pr_{S,S'}\!\left[\sup_h (L_{S'}(h) - L_S(h)) > \frac{\epsilon}{2}\right]}_{\text{no longer involves } L_{\mathcal{D}}}. \quad \square $$

Numerical Sanity Check

Suppose the bad event \(A\) has probability \(p = 0.10\) (10% of training sets lead to overfitting). Steps 1–2 say: among those unlucky 10% of training sets, at least half of the time a ghost sample would also reveal the problem. So the ghost event \(B\) has probability \(\geq 0.10 \times 0.50 = 0.05\).

Turning this around: \(p \leq 2 \times \Pr[B]\). If we can show that \(\Pr[B] \leq 0.03\), then \(p \leq 0.06\). We've bounded the unknowable event \(A\) using the analyzable event \(B\).

Analogy: The Exam Trick

Imagine a student (hypothesis \(h^*\)) who scores 95% on their practice test (\(L_S = 0.05\)) but actually knows only 70% of the material (\(L_{\mathcal{D}} = 0.30\)). They got lucky: the practice test happened to avoid their weak spots.

Now give them a second, independent practice test (\(S'\)). Since the second test is also drawn from the full material, the student will probably score around 70% again (\(L_{S'} \approx 0.30\)). The gap between their two scores (\(0.30 - 0.05 = 0.25\)) reveals the overfitting, even though we never knew the true difficulty.

That's symmetrization: replace the unknowable "true exam" with a second random exam, and the cheater is caught.

Why “Symmetrization”?

The name comes from what happens next. Once we have two samples \(S\) and \(S'\), the quantity \(L_{S'}(h) - L_S(h)\) is a symmetric function of the combined data: swapping any data point between \(S\) and \(S'\) doesn't change the distribution of the difference. This symmetry is what lets us condition on the combined sample \(C = S \cup S'\) and reduce the problem to analyzing random partitions — which is where the growth function enters.

The next section makes this precise, combining symmetrization with Sauer–Shelah to get the full VC generalization bound.

10. The VC Generalization Bound

What We’re About to Prove and Why It Matters

This is the payoff of everything we've built. We're about to prove the single most important result in classical learning theory: if a hypothesis class has finite VC dimension, then training error reliably approximates true error. Specifically, we'll bound the probability that any hypothesis in the class has a large gap between its training and true error.

Why is this the central result? Because it answers the original question from Section 1: "when does ERM work?" The bound says: collect enough samples (proportional to VC dimension), and training error becomes a trustworthy estimate of true error — simultaneously for every hypothesis — regardless of what distribution generated the data. That's uniform convergence, and we showed in Section 1 that uniform convergence makes ERM work.

Recall what we've assembled to get here:

  • Hoeffding (Section 4): concentrates a single hypothesis's training error around its true error.
  • Sauer–Shelah (Section 8): limits the number of distinct behaviors on \(m\) points to \(\leq (em/d)^d\).
  • Symmetrization (Section 9): replaces the unknowable \(L_{\mathcal{D}}\) with a second sample \(L_{S'}\), and the unknowable distribution with a random partition of fixed points.

Now we chain them together. Here's the proof roadmap before we dive in:

Proof Roadmap (4 Steps)

  1. Symmetrize: Replace \(\Pr[\text{gap with } L_{\mathcal{D}}]\) by \(\Pr[\text{gap between two samples}]\). Cost: factor of 2 and \(\epsilon \to \epsilon/2\).
  2. Fix the pool: Condition on the combined \(2m\) data points. Now the only randomness is which half goes to \(S\) vs \(S'\).
  3. Count behaviors: On these \(2m\) fixed points, Sauer–Shelah says at most \(\Pi_{\mathcal{H}}(2m)\) hypotheses behave differently. So we've reduced an infinite class to a finite one.
  4. Union bound + Hoeffding: For each of the \(\leq \Pi_{\mathcal{H}}(2m)\) distinct behaviors, apply Hoeffding to the random partition. Union bound over all behaviors.

Theorem (VC Generalization Bound) Let \(\operatorname{VCdim}(\mathcal{H}) = d < \infty\). Then: $$ \Pr_S\!\left[\sup_{h \in \mathcal{H}} |L_S(h) - L_{\mathcal{D}}(h)| > \epsilon\right] \leq 4\, \Pi_{\mathcal{H}}(2m)\, e^{-m\epsilon^2/8} \leq 4\left(\frac{2em}{d}\right)^d e^{-m\epsilon^2/8}. $$

Let's read this before proving it. The left side is the probability that uniform convergence fails — that some hypothesis has training error far from true error. The right side has two competing terms:

  • \(4(2em/d)^d\) — a polynomial in \(m\) that grows with more data (more points means more possible behavior patterns to union-bound over).
  • \(e^{-m\epsilon^2/8}\) — an exponential in \(m\) that shrinks with more data (each individual hypothesis concentrates better).

The exponential decay crushes the polynomial growth. For large enough \(m\), the right side \(\to 0\). That's the whole game: Sauer–Shelah guarantees that the number of things to union-bound over grows only polynomially, and Hoeffding provides exponential concentration for each one. Polynomial times exponentially-small equals vanishingly-small.

Proof

Step 1: Symmetrize. By the symmetrization lemma from Section 9 (applied to both the upper and lower tails, which adds another factor of 2):

$$ \Pr_S\!\left[\sup_h |L_S(h) - L_{\mathcal{D}}(h)| > \epsilon\right] \leq 2\, \Pr_{S,S'}\!\left[\sup_h |L_S(h) - L_{S'}(h)| > \frac{\epsilon}{2}\right]. $$

We've traded the unknowable \(L_{\mathcal{D}}\) for the concrete \(L_{S'}\). Now both sides of the gap are empirical quantities computed on actual data points.

Step 2: Fix the pool and randomize over the partition. Let \(C = (z_1, \ldots, z_{2m})\) be the combined pool of all \(2m\) data points from \(S\) and \(S'\) together. We condition on \(C\) — that is, we treat these \(2m\) points as fixed and known.

What's still random? Only the split: which \(m\) of the \(2m\) points end up in \(S\) and which end up in \(S'\). This is a uniformly random partition of a fixed pool into two equal halves.

For a fixed hypothesis \(h\), once we know \(C\), we know exactly which points \(h\) gets right and which it gets wrong. Specifically, define the binary vector:

$$ v_h = (\mathbf{1}[h(x_1) \neq y_1],\ \mathbf{1}[h(x_2) \neq y_2],\ \ldots,\ \mathbf{1}[h(x_{2m}) \neq y_{2m}]) \in \{0,1\}^{2m}. $$

This vector records the right/wrong pattern of \(h\) on all \(2m\) points. The difference \(L_S(h) - L_{S'}(h)\) then depends only on \(v_h\) and on which points go to \(S\) vs \(S'\):

$$ L_S(h) - L_{S'}(h) = \frac{\text{(mistakes by } h \text{ on } S\text{)}}{m} - \frac{\text{(mistakes by } h \text{ on } S'\text{)}}{m}. $$

If the random partition sends more of \(h\)'s mistakes to \(S\) than to \(S'\), this difference is positive; if it sends more to \(S'\), it's negative. On average (over random partitions), it's zero.

Step 3: Count distinct behaviors (this is where Sauer–Shelah enters). Here's the key observation: two hypotheses \(h_1\) and \(h_2\) with the same binary vector \(v_{h_1} = v_{h_2}\) will have exactly the same value of \(L_S(h) - L_{S'}(h)\) for every partition. They are indistinguishable on these \(2m\) points.

So the number of distinct behaviors we need to worry about is not \(|\mathcal{H}|\) (which may be infinite), but the number of distinct binary vectors \(v_h\), which is \(|\mathcal{H}_{C_x}|\) (the restriction of \(\mathcal{H}\) to the input points of \(C\)). By the definition of the growth function:

$$ |\mathcal{H}_{C_x}| \leq \Pi_{\mathcal{H}}(2m). $$

By Sauer–Shelah: \(\Pi_{\mathcal{H}}(2m) \leq (2em/d)^d\). We've collapsed an infinite class into at most \((2em/d)^d\) distinct behaviors — a finite number that grows only polynomially in \(m\).

Step 4: Hoeffding for each behavior, then union bound. Fix one behavior pattern \(v \in \{0,1\}^{2m}\). The difference \(L_S(h) - L_{S'}(h)\) is now a function of the random partition only. Think of it this way: we have \(2m\) coins (the entries of \(v\)), and we randomly deal \(m\) of them to pile \(S\) and the other \(m\) to pile \(S'\). The difference in pile averages is the average of \(2m\) terms, each bounded by \(1/m\), with the randomness coming from the assignment.

By Hoeffding's inequality applied to this random partition:

$$ \Pr_{\text{partition}}\!\left[|L_S(h) - L_{S'}(h)| > \frac{\epsilon}{2}\ \Big|\ C\right] \leq 2\exp\!\left(\frac{-2(\epsilon/2)^2}{\sum_{k=1}^{2m}(1/m)^2}\right) = 2\exp\!\left(\frac{-\epsilon^2/2}{2m/m^2}\right) = 2e^{-m\epsilon^2/4}. $$

Wait, where did \(e^{-m\epsilon^2/8}\) in the theorem come from instead of \(e^{-m\epsilon^2/4}\)? The exact constant depends on how carefully you handle the partition (the \(\sigma_k\) are not fully independent due to the constraint that exactly \(m\) go to each side). A more careful analysis gives \(2e^{-m\epsilon^2/8}\). The important point is that it's exponential in \(m\).

Now union bound over all \(\leq \Pi_{\mathcal{H}}(2m)\) distinct behavior patterns:

$$ \Pr_{\text{partition}}\!\left[\sup_h |L_S(h) - L_{S'}(h)| > \frac{\epsilon}{2}\ \Big|\ C\right] \leq 2\, \Pi_{\mathcal{H}}(2m)\, e^{-m\epsilon^2/8}. $$

This holds for every fixed pool \(C\), so it also holds unconditionally (averaging over \(C\) can only preserve or tighten the bound). Combining with the factor of 2 from Step 1:

$$ \Pr_S\!\left[\sup_h |L_S(h) - L_{\mathcal{D}}(h)| > \epsilon\right] \leq 2 \times 2\, \Pi_{\mathcal{H}}(2m)\, e^{-m\epsilon^2/8} = 4\, \Pi_{\mathcal{H}}(2m)\, e^{-m\epsilon^2/8}. $$

Applying Sauer–Shelah: \(\Pi_{\mathcal{H}}(2m) \leq (2em/d)^d\). \(\square\)

What Does This Bound Actually Tell You?

Setting the bound \(\leq \delta\) and solving for \(m\) gives the sample complexity:

$$ m = O\!\left(\frac{d \ln(m/d) + \ln(1/\delta)}{\epsilon^2}\right) $$

Absorbing the mild logarithmic dependence on \(m\): \(\displaystyle m = O\!\left(\frac{d + \ln(1/\delta)}{\epsilon^2}\right)\) suffices for uniform convergence.

Let's unpack what this formula says in plain English:

  • \(d\) (VC dimension): More complex hypothesis classes need more data. This is the "price of expressiveness" — a class that can represent more patterns needs more evidence to distinguish real patterns from coincidences.
  • \(1/\epsilon^2\): Tighter accuracy demands quadratically more data. Cutting the error tolerance in half requires 4x the data.
  • \(\ln(1/\delta)\): Higher confidence is cheap — only logarithmically more data. Going from 95% confidence to 99.9% confidence barely changes the sample size.
  • No dependence on \(|\mathcal{H}|\): The size of the class — even if uncountably infinite — doesn't appear. This is the miracle of VC theory.

Example: How Many Emails Do You Need?

You're building a spam classifier using linear classifiers in \(\mathbb{R}^{50}\) (50 features extracted from each email). VC dimension: \(d = 51\).

Scenario 1: You want the training error to be within \(\epsilon = 0.05\) (5%) of the true error, with 95% confidence (\(\delta = 0.05\)).

$$ m \approx \frac{51 + \ln(20)}{2 \times 0.0025} = \frac{51 + 3}{0.005} = \frac{54}{0.005} = 10{,}800 \text{ emails.} $$

With about 11,000 labeled emails, you can trust that every linear classifier's training error is within 5% of its true error. ERM will find one that's within 10% of optimal (the \(2\epsilon\) from the ERM argument in Section 1).

Scenario 2: You tighten to \(\epsilon = 0.01\) (1% accuracy).

$$ m \approx \frac{54}{2 \times 0.0001} = \frac{54}{0.0002} = 270{,}000 \text{ emails.} $$

5x tighter accuracy costs 25x more data (the \(1/\epsilon^2\) scaling). This is why practitioners often settle for "good enough" accuracy rather than demanding perfection.

Scenario 3: You boost confidence to \(\delta = 0.001\) (99.9%) while keeping \(\epsilon = 0.05\).

$$ m \approx \frac{51 + \ln(1000)}{0.005} = \frac{51 + 6.9}{0.005} = \frac{57.9}{0.005} = 11{,}580 \text{ emails.} $$

Going from 95% to 99.9% confidence only added ~800 emails. Confidence is cheap. Accuracy is expensive.

The Big Picture

The VC generalization bound resolves the fundamental tension in machine learning:

  • Expressive classes (high VC dimension) can fit complex patterns but need lots of data to distinguish signal from noise.
  • Simple classes (low VC dimension) generalize from little data but might miss real patterns.

The bound quantifies this tradeoff exactly: sample complexity scales linearly with VC dimension. And the fact that it depends on VC dimension (a finite number) rather than \(|\mathcal{H}|\) (potentially infinite) is what makes learning with infinite hypothesis classes possible at all. Linear classifiers in \(\mathbb{R}^{100}\) have VC dimension 101 — so despite there being uncountably many of them, about 10,000 samples suffice for generalization at 5% accuracy.

11. The No-Free-Lunch Theorem

What We Need and Why

In Section 10 we proved the forward direction: finite VC dimension \(\Rightarrow\) learnable. But is the converse true? If a hypothesis class has infinite VC dimension, does that necessarily mean it can't be learned?

The answer is yes, and this is the No-Free-Lunch theorem. It says that for any learning algorithm and any sample size, if \(\operatorname{VCdim}(\mathcal{H}) = \infty\), an adversary can always find a distribution that makes the algorithm fail badly. The intuition: an infinitely expressive class can fit any pattern, so an adversary can always hide a "trap" in the unseen data that no algorithm can anticipate.

Theorem (No Free Lunch) Let \(\mathcal{H}\) be a hypothesis class with \(\operatorname{VCdim}(\mathcal{H}) = \infty\). Then for any learning algorithm \(A\) and any sample size \(m\), there exists a distribution \(\mathcal{D}\) such that \(\min_{h \in \mathcal{H}} L_{\mathcal{D}}(h) = 0\) but $$ \mathbb{E}_{S \sim \mathcal{D}^m}[L_{\mathcal{D}}(A(S))] \geq \frac{1}{4}. $$

Read this carefully: the adversary gets to choose \(\mathcal{D}\) after seeing which algorithm \(A\) the learner uses and how much data \(m\) the learner gets. Yet even with this information, the best the adversary can guarantee is that some \(h \in \mathcal{H}\) achieves zero error (so the problem is "realizable" — there is a perfect answer), but the learner's expected error is still at least \(1/4\). The learner isn't just slightly wrong — it's wrong on a quarter of the distribution, no matter what it does.

Proof: The Adversary's Strategy

Proof

The proof constructs a specific adversarial distribution that defeats any learner. There are three key ideas: (1) use shattering to guarantee a perfect hypothesis exists, (2) hide the true labeling by choosing it at random, (3) show that unseen points are unpredictable.

Step 1: Choose the domain.

Since \(\operatorname{VCdim}(\mathcal{H}) = \infty\), for any number \(n\) we can find a set of size \(n\) that \(\mathcal{H}\) shatters. The learner will draw \(m\) training points, so pick \(n = 2m\) — twice the training set size. Let \(C = \{x_1, x_2, \ldots, x_{2m}\}\) be a set of \(2m\) points that \(\mathcal{H}\) shatters.

Why \(2m\)? We need enough points that, even after the learner sees \(m\) draws, a substantial fraction of \(C\) remains unseen. Choosing \(2m\) ensures (as we'll compute below) that roughly 60% of the points are never seen in the training sample.

Step 2: Choose a random labeling.

The adversary picks a labeling \(f : C \to \{0,1\}\) uniformly at random: each \(f(x_i)\) is an independent fair coin flip, with \(\Pr[f(x_i) = 0] = \Pr[f(x_i) = 1] = 1/2\). There are \(2^{2m}\) possible labelings, and we pick one uniformly.

Now define the distribution \(\mathcal{D}\) to be the uniform distribution on the labeled points \(\{(x_1, f(x_1)),\, (x_2, f(x_2)),\, \ldots,\, (x_{2m}, f(x_{2m}))\}\). Each draw from \(\mathcal{D}\) picks one of the \(2m\) labeled pairs uniformly at random.

Why does a perfect hypothesis exist? Because \(\mathcal{H}\) shatters \(C\), every possible labeling of \(C\) is realized by some \(h \in \mathcal{H}\). In particular, the random labeling \(f\) is realized by some \(h^* \in \mathcal{H}\) with \(h^*(x_i) = f(x_i)\) for all \(i\). This \(h^*\) gets every point right, so \(L_{\mathcal{D}}(h^*) = 0\).

Step 3: Count how many points the learner misses.

The learner draws a training sample \(S\) of \(m\) points, each drawn independently and uniformly from the \(2m\) points in \(C\) (with replacement, since that's how i.i.d. sampling works). Some points will appear multiple times; others won't appear at all. How many distinct points does the learner actually see?

Focus on a single point \(x_j \in C\). On each draw, the probability of picking \(x_j\) is \(1/(2m)\). The probability of not picking \(x_j\) on a single draw is \(1 - 1/(2m)\). Since the \(m\) draws are independent:

$$ \Pr[x_j \text{ never drawn}] = \left(1 - \frac{1}{2m}\right)^m. $$

This is a classic "birthday problem" style calculation. To evaluate this, recall the fundamental limit:

The Key Approximation For large \(n\), \(\left(1 - \frac{1}{n}\right)^n \approx e^{-1}\). More precisely, \(\left(1 - \frac{1}{n}\right)^n\) increases monotonically to \(e^{-1}\) as \(n \to \infty\).

This is essentially the definition of \(e\): \(e = \lim_{n \to \infty}\left(1 + 1/n\right)^n\), so \(e^{-1} = \lim_{n \to \infty}\left(1 - 1/n\right)^n\).

In our case, \(n = 2m\) and the exponent is \(m = n/2\), so:

$$ \left(1 - \frac{1}{2m}\right)^m = \left[\left(1 - \frac{1}{2m}\right)^{2m}\right]^{1/2} \approx \left[e^{-1}\right]^{1/2} = e^{-1/2} = \frac{1}{\sqrt{e}} \approx 0.6065. $$

So each individual point has about a 60.65% chance of being missed entirely!

The expected number of unseen points is (by linearity of expectation, summing over all \(2m\) points):

$$ \mathbb{E}[\text{unseen}] = 2m \cdot \left(1 - \frac{1}{2m}\right)^m \approx \frac{2m}{\sqrt{e}} \approx 1.213m. $$

And the expected number of seen (distinct) points is:

$$ \mathbb{E}[\text{seen}] = 2m - \mathbb{E}[\text{unseen}] = 2m\left[1 - \left(1 - \frac{1}{2m}\right)^m\right] \approx 2m\!\left(1 - \frac{1}{\sqrt{e}}\right) \approx 0.787m. $$

Numerical Check (\(m = 10\))

We have \(2m = 20\) points and draw \(m = 10\) times with replacement.

  • Probability a specific point is never drawn: \((1 - 1/20)^{10} = (19/20)^{10} = 0.95^{10} \approx 0.5987\).
  • Expected unseen: \(20 \times 0.5987 \approx 11.97\) points.
  • Expected seen: \(20 - 11.97 \approx 8.03\) distinct points out of 20.

So even with 10 draws from 20 points, the learner typically sees only about 8 distinct points and misses about 12. The approximation \(20/\sqrt{e} \approx 12.13\) is close to the exact 11.97. \(\checkmark\)

Step 4: Unseen points are unpredictable.

This is the heart of the argument. Consider a point \(x_j\) that the learner never sees in the training sample \(S\). What does the learner know about \(f(x_j)\)?

Absolutely nothing. Here's why: \(f(x_j)\) was chosen by an independent fair coin flip, and the learner's training data contains no information about this coin flip. The training sample tells the learner about \(f(x_i)\) for points that were drawn, but \(f(x_j)\) is independent of all those values (each label was an independent coin flip). So from the learner's perspective, \(f(x_j)\) is equally likely to be 0 or 1, regardless of what the learner saw in training.

This means that for any prediction the learner makes on \(x_j\) — whether it guesses 0 or 1 or flips its own coin — the probability of being wrong is exactly \(1/2\).

Step 5: Compute the expected error.

The learner's true error \(L_{\mathcal{D}}(A(S))\) is the probability of misclassifying a random point from \(\mathcal{D}\). Since \(\mathcal{D}\) is uniform on all \(2m\) points:

$$ L_{\mathcal{D}}(A(S)) = \frac{1}{2m}\sum_{i=1}^{2m} \mathbf{1}[A(S)(x_i) \neq f(x_i)]. $$

Split this into seen and unseen points:

$$ L_{\mathcal{D}}(A(S)) = \frac{1}{2m}\!\left(\sum_{x_i \in \text{seen}} \mathbf{1}[A(S)(x_i) \neq f(x_i)] \;+\; \sum_{x_j \in \text{unseen}} \mathbf{1}[A(S)(x_j) \neq f(x_j)]\right). $$

The first sum (seen points) is \(\geq 0\) — the learner might get them right, might not, but we don't need this term. The second sum is what gives us the lower bound. On each unseen point, the expected error is \(1/2\), so:

$$ \mathbb{E}[L_{\mathcal{D}}(A(S))] \;\geq\; \frac{1}{2m} \cdot \mathbb{E}[\text{unseen}] \cdot \frac{1}{2} \;=\; \frac{1}{2m} \cdot \frac{2m}{\sqrt{e}} \cdot \frac{1}{2} \;=\; \frac{1}{2\sqrt{e}} \;\approx\; 0.303. $$

Since \(\frac{1}{2\sqrt{e}} \approx 0.303 > 0.25 = \frac{1}{4}\), we conclude:

$$ \mathbb{E}_{S \sim \mathcal{D}^m}[L_{\mathcal{D}}(A(S))] \;\geq\; \frac{1}{2\sqrt{e}} \;\geq\; \frac{1}{4}. \quad \square $$

Concrete Scenario

Suppose \(\mathcal{H}\) is the class of all binary functions on \(\mathbb{R}\) (infinite VC dimension), and a learner gets \(m = 50\) training points.

  • The adversary picks \(2m = 100\) points that \(\mathcal{H}\) shatters and a random labeling.
  • The learner draws 50 points from these 100 (with replacement), seeing about \(100(1 - 0.95^{50}) \approx 100 \times 0.923 = 92.3\) — wait, let's compute correctly: \((1 - 1/100)^{50} = 0.99^{50} \approx 0.605\). So about \(100 \times 0.605 \approx 60.5\) points are unseen, and about 39.5 are seen.
  • On the ~60.5 unseen points, the learner is guessing: expected error rate on those is 1/2.
  • Overall expected error \(\geq 60.5/100 \times 1/2 \approx 0.303\), which exceeds \(1/4\).

The learner has seen less than 40% of the domain and is blind on the rest. No algorithm can beat random guessing on the unseen portion, because the labels there are independent coin flips that the training data reveals nothing about.

Why This Matters

The No-Free-Lunch theorem isn't saying "some algorithms are bad." It's saying all algorithms are bad when the hypothesis class is too expressive. The adversary's construction works against any algorithm \(A\) — neural networks, SVMs, decision trees, even algorithms that haven't been invented yet.

The reason is information-theoretic: with \(m\) samples from a domain of size \(2m\), the learner inevitably misses about 60% of the points. When the labeling is random and the unseen labels are independent of the seen ones, no amount of cleverness helps on the unseen portion.

Combined with Section 10 (finite VC dim \(\Rightarrow\) learnable), this gives us a complete characterization: a class is learnable if and only if its VC dimension is finite.

12. The Fundamental Theorem of Statistical Learning

Taking Stock

We've now proven two directions:

  • Finite VC dim \(\Rightarrow\) learnable (Section 10): the VC generalization bound shows that uniform convergence holds, which makes ERM work.
  • Infinite VC dim \(\Rightarrow\) not learnable (Section 11): the No-Free-Lunch theorem shows that no algorithm can learn.

Together, these give a complete characterization: a hypothesis class is learnable if and only if it has finite VC dimension. But to state this precisely, we need to be clear about what "learnable" means — because there are actually two notions, and they both end up being equivalent.

PAC Learning vs. Agnostic PAC Learning: A Reminder

Back in Section 1, we defined PAC learnability in its agnostic form, which is the stronger version. Let's now clearly distinguish the two notions, since the theorem says they're equivalent.

Definition (PAC Learning — Realizable Case) \(\mathcal{H}\) is PAC learnable if there exists an algorithm \(A\) and sample complexity \(m_{\mathcal{H}}(\epsilon, \delta)\) such that: for every distribution \(\mathcal{D}\) for which some \(h^* \in \mathcal{H}\) achieves zero error (i.e., \(L_{\mathcal{D}}(h^*) = 0\)), if \(m \geq m_{\mathcal{H}}(\epsilon, \delta)\), then with probability \(\geq 1 - \delta\): $$ L_{\mathcal{D}}(A(S)) \leq \epsilon. $$ The assumption \(\min_{h \in \mathcal{H}} L_{\mathcal{D}}(h) = 0\) is called realizability: the true labeling rule is in our hypothesis class.

Definition (Agnostic PAC Learning) \(\mathcal{H}\) is agnostically PAC learnable if there exists an algorithm \(A\) and sample complexity \(m_{\mathcal{H}}(\epsilon, \delta)\) such that: for every distribution \(\mathcal{D}\) (no realizability assumption), if \(m \geq m_{\mathcal{H}}(\epsilon, \delta)\), then with probability \(\geq 1 - \delta\): $$ L_{\mathcal{D}}(A(S)) \leq \min_{h \in \mathcal{H}} L_{\mathcal{D}}(h) + \epsilon. $$ No hypothesis in \(\mathcal{H}\) needs to be perfect. The learner just needs to get within \(\epsilon\) of the best available hypothesis.

The Difference at a Glance

  • PAC (realizable): "Assuming the answer is in \(\mathcal{H}\), find one that's close to perfect." The guarantee is absolute: error \(\leq \epsilon\).
  • Agnostic PAC: "Find one that's close to the best in \(\mathcal{H}\), even if the best isn't perfect." The guarantee is relative: error \(\leq \operatorname{opt} + \epsilon\), where \(\operatorname{opt} = \min_h L_{\mathcal{D}}(h)\).

Agnostic PAC is strictly harder: it must work for every distribution, including ones where no \(h \in \mathcal{H}\) fits the data well. PAC learning only needs to work when the problem is "nice" (realizable). So agnostic PAC learnable \(\Rightarrow\) PAC learnable, but the converse isn't obvious at all.

Example: If you're classifying emails with linear classifiers but emails aren't linearly separable, the PAC guarantee says nothing (realizability fails). The agnostic guarantee still promises you'll get within \(\epsilon\) of the best linear classifier — even though that best classifier has nonzero error.

The remarkable content of the Fundamental Theorem is that these two notions are equivalent: if a class is PAC learnable (the easy version), it's also agnostically PAC learnable (the hard version). You don't get to be "learnable only when the problem is nice." Either the class is learnable in all situations, or it's not learnable at all.

The Theorem

The Fundamental Theorem of Statistical Learning Let \(\mathcal{H}\) be a hypothesis class of functions from \(\mathcal{X}\) to \(\{0,1\}\). The following are equivalent:

  1. \(\mathcal{H}\) has the uniform convergence property (Section 1: training error tracks true error for all \(h\) simultaneously).
  2. \(\mathcal{H}\) is agnostically PAC learnable (by any ERM algorithm).
  3. \(\mathcal{H}\) is PAC learnable (in the realizable sense).
  4. \(\operatorname{VCdim}(\mathcal{H}) < \infty\).

The Proof Chain

We prove this by establishing a cycle of implications: (4) \(\Rightarrow\) (1) \(\Rightarrow\) (2) \(\Rightarrow\) (3) \(\Rightarrow\) (4). Each arrow has already been proven or is straightforward.

Proof

(4) \(\Rightarrow\) (1): Finite VC dim \(\Rightarrow\) Uniform Convergence.

This is the VC generalization bound from Section 10. We proved it through the chain: finite VC dim \(\xrightarrow{\text{Sauer–Shelah (Sec. 8)}}\) polynomial growth function \(\xrightarrow{\text{symmetrization (Sec. 9) + Hoeffding (Sec. 4)}}\) uniform convergence. Specifically, Section 10 showed:

$$ \Pr\!\left[\sup_{h \in \mathcal{H}} |L_S(h) - L_{\mathcal{D}}(h)| > \epsilon\right] \leq 4\left(\frac{2em}{d}\right)^d e^{-m\epsilon^2/8}, $$

which goes to zero for any fixed \(\epsilon\) as \(m \to \infty\).

(1) \(\Rightarrow\) (2): Uniform Convergence \(\Rightarrow\) Agnostically PAC Learnable via ERM.

This is the three-step argument from Section 1. If \(|L_S(h) - L_{\mathcal{D}}(h)| \leq \epsilon\) for all \(h\), then ERM's output \(\hat{h}\) satisfies \(L_{\mathcal{D}}(\hat{h}) \leq \min_h L_{\mathcal{D}}(h) + 2\epsilon\). (Step 1: true error of ERM's pick is close to its training error. Step 2: ERM picks the lowest training error. Step 3: training error of the best \(h\) is close to its true error. Chain them.) Agnostic PAC learnability follows with the ERM algorithm.

(2) \(\Rightarrow\) (3): Agnostic PAC \(\Rightarrow\) PAC.

This is immediate from the definitions. Agnostic PAC learnability guarantees \(L_{\mathcal{D}}(A(S)) \leq \min_h L_{\mathcal{D}}(h) + \epsilon\) for every distribution. PAC learnability only requires this for distributions where \(\min_h L_{\mathcal{D}}(h) = 0\), which is a subset of all distributions. A stronger guarantee implies a weaker one.

(3) \(\Rightarrow\) (4): PAC Learnable \(\Rightarrow\) Finite VC dim.

This is the contrapositive of the No-Free-Lunch theorem (Section 11). We proved: if \(\operatorname{VCdim}(\mathcal{H}) = \infty\), then the class is not even PAC learnable (the adversary can defeat any algorithm even under realizability). Equivalently: if \(\mathcal{H}\) is PAC learnable, then \(\operatorname{VCdim}(\mathcal{H}) < \infty\).

The cycle (4) \(\Rightarrow\) (1) \(\Rightarrow\) (2) \(\Rightarrow\) (3) \(\Rightarrow\) (4) is complete. \(\square\)

Where Does the Sample Complexity Formula Come From?

The theorem also pins down how many samples are needed:

Sample Complexity Bounds Let \(d = \operatorname{VCdim}(\mathcal{H})\). The sample complexity \(m_{\mathcal{H}}(\epsilon, \delta)\) for agnostic PAC learning satisfies: $$ C_1 \frac{d + \ln(1/\delta)}{\epsilon^2} \;\leq\; m_{\mathcal{H}}(\epsilon, \delta) \;\leq\; C_2 \frac{d\ln(1/\epsilon) + \ln(1/\delta)}{\epsilon^2} $$ for universal constants \(C_1, C_2 > 0\).

This double inequality says two things: how many samples suffice (upper bound), and how many are necessary (lower bound). Let's trace each one back to what we've proven.

The upper bound (how many suffice) comes from the VC generalization bound in Section 10. We showed:

$$ \Pr\!\left[\sup_h |L_S(h) - L_{\mathcal{D}}(h)| > \epsilon\right] \leq 4\left(\frac{2em}{d}\right)^d e^{-m\epsilon^2/8}. $$

Setting this \(\leq \delta\) and solving for \(m\): the exponential \(e^{-m\epsilon^2/8}\) eventually dominates the polynomial \((2em/d)^d\), so the right-hand side drops below \(\delta\) when \(m\) is large enough. The polynomial contributes a \(d\ln(m)\) term in the exponent (since \(\ln[(2em/d)^d] = d\ln(2em/d)\)), and the \(\ln(m)\) can be absorbed into a \(\ln(1/\epsilon)\) factor by a standard argument (since we need \(m \sim 1/\epsilon^2\), we have \(\ln m \sim \ln(1/\epsilon)\)). Together with the \(\ln(1/\delta)\) from setting the tail probability to \(\delta\), this gives:

$$ m \leq C_2 \frac{d\ln(1/\epsilon) + \ln(1/\delta)}{\epsilon^2}. $$

This is essentially what Section 10 established (with the simplified form \(m = O((d + \ln(1/\delta))/\epsilon^2)\) after absorbing the logarithmic \(\ln(1/\epsilon)\) into the constant).

The lower bound (how many are necessary) comes from information theory. The core idea is beautiful: learning is fundamentally a hypothesis testing problem. The learner sees noisy data and must figure out which of many possible "worlds" (distributions) generated it. Information theory tells us exactly how hard this identification is.

To prove this, we need to build three tools from scratch: total variation distance (how "different" two distributions look), KL divergence (how much evidence each sample provides), and Pinsker's inequality (connecting them). These are foundational tools that appear everywhere in statistics, information theory, and machine learning — building them here pays dividends far beyond this proof.

Tool 1: Total Variation Distance

When are two probability distributions "close"? We need a way to measure this. The most natural notion: two distributions are close if no event has very different probability under them.

Definition (Total Variation Distance) For two probability distributions \(P\) and \(Q\) on the same space \(\Omega\), the total variation distance is: $$ \operatorname{TV}(P, Q) = \sup_{A \subseteq \Omega} |P(A) - Q(A)| = \frac{1}{2}\sum_{\omega \in \Omega} |P(\omega) - Q(\omega)|. $$ The two expressions are equal (for discrete distributions), and \(\operatorname{TV}(P,Q) \in [0, 1]\).

Where does this come from? Total variation was formalized in measure theory, but the idea is ancient: it's the largest difference in probability that any event can exhibit between the two distributions. If \(\operatorname{TV}(P, Q) = 0\), the distributions are identical. If \(\operatorname{TV}(P, Q) = 1\), they are completely distinguishable (supported on disjoint sets).

Example (Two Coins)

Let \(P = \operatorname{Bernoulli}(0.6)\) and \(Q = \operatorname{Bernoulli}(0.4)\). Then:

$$ \operatorname{TV}(P, Q) = \frac{1}{2}(|0.6 - 0.4| + |0.4 - 0.6|) = \frac{1}{2}(0.2 + 0.2) = 0.2. $$

The most distinguishing event is \(A = \{1\}\) (heads): \(P(A) = 0.6\), \(Q(A) = 0.4\), gap \(= 0.2\). \(\checkmark\)

Example (Fair vs. Biased Die)

Let \(P\) = fair die (each face has probability \(1/6\)) and \(Q\) = loaded die with \(Q(6) = 1/2\) and \(Q(i) = 1/10\) for \(i = 1, \ldots, 5\). Then:

$$ \operatorname{TV}(P, Q) = \frac{1}{2}\!\left(5 \cdot |1/6 - 1/10| + |1/6 - 1/2|\right) = \frac{1}{2}\!\left(5 \cdot \frac{1}{15} + \frac{1}{3}\right) = \frac{1}{2}\!\left(\frac{1}{3} + \frac{1}{3}\right) = \frac{1}{3}. $$

A single roll gives you a 1/3 chance of noticing the difference — not great for telling the dice apart.

The key property for us: total variation directly controls how well any algorithm can distinguish \(P\) from \(Q\).

Lemma (Testing Bound) Let \(P\) and \(Q\) be two distributions. Any algorithm that sees one sample and tries to decide whether it came from \(P\) or \(Q\) has error probability at least: $$ \Pr[\text{error}] \geq \frac{1 - \operatorname{TV}(P, Q)}{2}. $$

Proof

The algorithm is a function \(\psi : \Omega \to \{P, Q\}\) (it sees a sample \(\omega\) and guesses which distribution generated it). The error probability (averaged over the two hypotheses being equally likely) is:

$$ \Pr[\text{error}] = \frac{1}{2}P(\psi(\omega) = Q) + \frac{1}{2}Q(\psi(\omega) = P). $$

Let \(A = \{\omega : \psi(\omega) = P\}\) be the "accept \(P\)" region. Then:

$$ \Pr[\text{error}] = \frac{1}{2}(1 - P(A)) + \frac{1}{2}Q(A) = \frac{1}{2} + \frac{1}{2}(Q(A) - P(A)) \geq \frac{1}{2} - \frac{1}{2}\sup_A |P(A) - Q(A)| = \frac{1 - \operatorname{TV}(P,Q)}{2}. \;\square $$

In words: if two distributions are close in total variation, no test can reliably tell them apart. If \(\operatorname{TV} = 0.1\), any test has error \(\geq 45\%\) — barely better than random guessing.

Tool 2: KL Divergence

Total variation (Tool 1) tells us the final answer: "how well can the best test distinguish \(P\) from \(Q\)?" But it doesn't tell us how to get there, or how the answer changes as we collect more data. For that we need a different measure — one that tracks how much evidence each individual sample provides. That measure is KL divergence.

To build intuition for what "evidence" means and why KL divergence is the right formula, let's think through a concrete scenario step by step.

Motivating Example: The Detective and Two Suspects

A coin is either 60% heads (suspect \(P\)) or 40% heads (suspect \(Q\)). You flip it once and see heads. How much evidence does this give you?

The natural measure is the likelihood ratio: how much more likely is heads under \(P\) than under \(Q\)?

$$ \frac{P(\text{heads})}{Q(\text{heads})} = \frac{0.6}{0.4} = 1.5. $$

Heads is 1.5× more likely under \(P\). That's some evidence for \(P\), but not overwhelming.

Now suppose you flip again and see tails. The likelihood ratio for tails is:

$$ \frac{P(\text{tails})}{Q(\text{tails})} = \frac{0.4}{0.6} \approx 0.667. $$

Tails is less likely under \(P\) — it's evidence against \(P\). The combined likelihood ratio for (heads, tails) is the product: \(1.5 \times 0.667 = 1.0\). The two flips cancelled out!

The problem with raw likelihood ratios: they multiply. After \(m\) flips, you'd have a product of \(m\) terms, which is algebraically unwieldy (products are hard to analyze, don't simplify with linearity of expectation, and can overflow or underflow numerically).

The fix: take the logarithm. The log-likelihood ratio for a single outcome \(\omega\) is:

$$ \ell(\omega) = \ln \frac{P(\omega)}{Q(\omega)}. $$

For heads: \(\ell(\text{H}) = \ln(1.5) = 0.405\). For tails: \(\ell(\text{T}) = \ln(0.667) = -0.405\).

Now the combined evidence for (heads, tails) is the sum: \(0.405 + (-0.405) = 0\). Logarithms turn products into sums — and sums are what probability theory handles best (linearity of expectation, law of large numbers, central limit theorem all work with sums).

So the log-likelihood ratio \(\ln(P(\omega)/Q(\omega))\) is the "evidence from one observation." But different outcomes give different amounts of evidence (heads favors \(P\), tails favors \(Q\)). What is the average evidence per observation, when the data really does come from \(P\)?

That average is exactly the KL divergence:

Definition (KL Divergence) For distributions \(P\) and \(Q\) on the same space \(\Omega\), with \(P(\omega) > 0 \Rightarrow Q(\omega) > 0\), the Kullback–Leibler divergence is: $$ \operatorname{KL}(P \| Q) = \sum_{\omega} P(\omega) \ln \frac{P(\omega)}{Q(\omega)} = \mathbb{E}_{\omega \sim P}\!\left[\ln \frac{P(\omega)}{Q(\omega)}\right]. $$

Let's unpack each piece of this formula:

  • \(P(\omega)/Q(\omega)\) is the likelihood ratio — "how much more likely is outcome \(\omega\) under \(P\) than under \(Q\)?" If this is 3, the outcome is 3× more likely under \(P\).
  • \(\ln(P(\omega)/Q(\omega))\) is the log-likelihood ratio — the evidence that outcome \(\omega\) provides in favor of \(P\) over \(Q\). Positive means evidence for \(P\); negative means evidence for \(Q\); zero means the outcome is equally likely under both. We use the natural logarithm (base \(e\)), so evidence is measured in nats (if we used log base 2, the unit would be "bits").
  • \(\sum P(\omega) \cdot [\ldots]\) is the expected value under \(P\) — we weight each outcome's evidence by how often it actually occurs (under the true distribution \(P\)).

So \(\operatorname{KL}(P \| Q)\) is the average evidence per sample in favor of \(P\), when the data really comes from \(P\). It answers: "if \(P\) is the truth, how quickly do the data accumulate evidence that points toward \(P\) and away from \(Q\)?"

Where does the name come from? Solomon Kullback and Richard Leibler introduced this quantity in 1951, building on Shannon's 1948 information theory. It appears naturally in hypothesis testing (Neyman–Pearson), data compression (source coding), and Bayesian inference (posterior updates). It's not a metric — \(\operatorname{KL}(P \| Q) \neq \operatorname{KL}(Q \| P)\) in general — because it asks an asymmetric question: "how much evidence do samples from \(P\) provide against \(Q\)?"

The first essential property: KL divergence is always non-negative. This makes intuitive sense: on average, data from \(P\) should favor \(P\) over any alternative \(Q\), never the reverse.

Gibbs' Inequality: \(\operatorname{KL}(P \| Q) \geq 0\), with equality iff \(P = Q\).

Proof

We use the concavity of the logarithm. Since \(\ln\) is concave, Jensen's inequality gives:

$$ -\operatorname{KL}(P \| Q) = \sum_{\omega} P(\omega) \ln \frac{Q(\omega)}{P(\omega)} \leq \ln\!\left(\sum_{\omega} P(\omega) \cdot \frac{Q(\omega)}{P(\omega)}\right) = \ln\!\left(\sum_{\omega} Q(\omega)\right) = \ln(1) = 0. $$

So \(\operatorname{KL}(P \| Q) \geq 0\). Equality holds iff \(Q(\omega)/P(\omega)\) is constant for all \(\omega\) with \(P(\omega) > 0\), which forces \(P = Q\). \(\square\)

In words: data from \(P\) never, on average, provides evidence in favor of any other distribution \(Q\). The truth always wins in expectation. (Individual samples can provide misleading evidence — a tails from the 60% coin makes \(Q\) look more likely — but on average over many samples, the truth accumulates.)

Example (Bernoulli KL Divergence — In Detail)

For \(P = \operatorname{Ber}(0.6)\) and \(Q = \operatorname{Ber}(0.4)\):

$$ \operatorname{KL}(P \| Q) = \underbrace{P(\text{H})}_{0.6} \cdot \underbrace{\ln \frac{P(\text{H})}{Q(\text{H})}}_{\ln(0.6/0.4)} + \underbrace{P(\text{T})}_{0.4} \cdot \underbrace{\ln \frac{P(\text{T})}{Q(\text{T})}}_{\ln(0.4/0.6)}. $$

Let's compute each term and understand what it means:

  • Heads term: evidence from heads = \(\ln(0.6/0.4) = \ln(1.5) = +0.405\) nats (positive: heads favors \(P\)). This happens with probability 0.6 under \(P\). Contribution: \(0.6 \times 0.405 = 0.243\).
  • Tails term: evidence from tails = \(\ln(0.4/0.6) = \ln(0.667) = -0.405\) nats (negative: tails favors \(Q\)). This happens with probability 0.4 under \(P\). Contribution: \(0.4 \times (-0.405) = -0.162\).

$$ \operatorname{KL}(P \| Q) = 0.243 + (-0.162) = 0.081 \text{ nats}. $$

What does 0.081 nats mean concretely? Each flip of the 60% coin gives you, on average, 0.081 nats of evidence that you're looking at the 60% coin rather than the 40% coin. This is small because the two coins almost cancel: heads gives +0.405 nats toward \(P\), but tails gives -0.405 nats toward \(Q\), and since heads only happens a bit more often (60% vs 40%), the net is a tiny +0.081.

To build up to ~1 nat of total evidence (a rough threshold where you can start to reliably distinguish the coins), you'd need about \(1/0.081 \approx 12\) flips. With only 3 or 4 flips, you have ~0.25–0.32 nats total — barely any signal through the noise.

Example (When KL is Large vs. Small)

Extreme case: \(P = \operatorname{Ber}(0.99)\), \(Q = \operatorname{Ber}(0.01)\).

$$ \operatorname{KL}(P \| Q) = 0.99 \ln\frac{0.99}{0.01} + 0.01\ln\frac{0.01}{0.99} = 0.99(4.595) + 0.01(-4.595) = 4.509 \text{ nats}. $$

One flip gives 4.5 nats of evidence! If you see heads (which happens 99% of the time under \(P\)), the log-likelihood ratio is \(\ln(99) = 4.6\), which is overwhelming evidence for \(P\). Even a single flip is enough to tell these coins apart.

Subtle case: \(P = \operatorname{Ber}(0.51)\), \(Q = \operatorname{Ber}(0.49)\). Then \(\operatorname{KL} \approx 0.0008\) nats. You'd need \(\sim 1/0.0008 = 1{,}250\) flips to accumulate 1 nat. Distinguishing a 51% coin from a 49% coin is genuinely hard.

We'll need the Bernoulli KL divergence for small \(\epsilon\). The following bound is key:

Lemma (Small-Bias Bernoulli KL) For \(|\epsilon| < 1/2\): $$ \operatorname{KL}\!\left(\operatorname{Ber}(1/2 + \epsilon) \,\big\|\, \operatorname{Ber}(1/2 - \epsilon)\right) \leq 8\epsilon^2. $$

Proof

Let \(p = 1/2 + \epsilon\) and \(q = 1/2 - \epsilon\). We need to show \(p\ln(p/q) + (1-p)\ln((1-p)/(1-q)) \leq 8\epsilon^2\).

Note that \(p/q = (1/2 + \epsilon)/(1/2 - \epsilon) = (1 + 2\epsilon)/(1 - 2\epsilon)\), and \((1-p)/(1-q) = (1/2 - \epsilon)/(1/2 + \epsilon) = (1 - 2\epsilon)/(1 + 2\epsilon)\). Let \(t = 2\epsilon\), so \(|t| < 1\). Then:

$$ \operatorname{KL} = \frac{1+t}{2}\ln\frac{1+t}{1-t} + \frac{1-t}{2}\ln\frac{1-t}{1+t}. $$

Since the second term is the negative of \(\frac{1-t}{2}\ln\frac{1+t}{1-t}\), this simplifies to:

$$ \operatorname{KL} = \frac{(1+t) - (1-t)}{2}\ln\frac{1+t}{1-t} = t \cdot \ln\frac{1+t}{1-t}. $$

Now use the Taylor series: \(\ln\frac{1+t}{1-t} = \ln(1+t) - \ln(1-t) = 2t + \frac{2t^3}{3} + \frac{2t^5}{5} + \cdots \leq 2t + 2t^3 + 2t^5 + \cdots = \frac{2t}{1-t^2}\). For \(|t| \leq 1/\sqrt{2}\) (i.e., \(|\epsilon| \leq 1/(2\sqrt{2}) \approx 0.354\)), we have \(1/(1-t^2) \leq 2\), so:

$$ \operatorname{KL} \leq t \cdot \frac{2t}{1-t^2} = \frac{2t^2}{1-t^2} \leq 4t^2 = 16\epsilon^2. $$

A tighter analysis using \(\ln\frac{1+t}{1-t} \leq 2t/(1-t^2) \leq 2t + 4t^3\) for \(|t| \leq 0.9\) gives \(\operatorname{KL} \leq 2t^2 + 4t^4 \leq 2t^2(1 + 2t^2)\). For \(|\epsilon| < 1/2\), \(t^2 = 4\epsilon^2 < 1\), so \(\operatorname{KL} \leq 2t^2 \cdot 2 = 4t^2 = 16\epsilon^2\). The constant 8 follows from the sharper bound \(\operatorname{KL} \leq \frac{2t^2}{1-t^2} \leq 8\epsilon^2\) which holds for \(|\epsilon| \leq 1/4\) (since then \(t^2 \leq 1/4\) and \(1/(1-t^2) \leq 4/3\), giving \(\operatorname{KL} \leq 2t^2 \cdot 4/3 = 8t^2/3 \leq 8\epsilon^2\) since \(t = 2\epsilon\)).

For our purposes, the key message is: \(\operatorname{KL} = \Theta(\epsilon^2)\) — the KL divergence between two coins that differ by \(\epsilon\) is proportional to \(\epsilon^2\). \(\square\)

Numerical Verification

For \(\epsilon = 0.1\): exact \(\operatorname{KL} = 0.081\), bound gives \(8 \times 0.01 = 0.08\). Very tight! (The bound is actually slightly conservative here.)

For \(\epsilon = 0.05\): exact \(\operatorname{KL} \approx 0.020\), bound gives \(8 \times 0.0025 = 0.02\). Nearly exact. \(\checkmark\)

The second essential property: KL divergence is additive for independent samples.

Product Property (Tensorization) If \(S = (Z_1, \ldots, Z_m)\) consists of \(m\) i.i.d. draws from \(P\) (respectively \(Q\)), then: $$ \operatorname{KL}(P^m \| Q^m) = m \cdot \operatorname{KL}(P \| Q). $$

Proof

For a product distribution, \(P^m(z_1, \ldots, z_m) = \prod_{i=1}^m P(z_i)\). So:

$$ \operatorname{KL}(P^m \| Q^m) = \mathbb{E}_{P^m}\!\left[\ln \frac{\prod P(Z_i)}{\prod Q(Z_i)}\right] = \mathbb{E}_{P^m}\!\left[\sum_{i=1}^m \ln \frac{P(Z_i)}{Q(Z_i)}\right] = \sum_{i=1}^m \mathbb{E}_P\!\left[\ln \frac{P(Z_i)}{Q(Z_i)}\right] = m \cdot \operatorname{KL}(P \| Q). $$

The key step: logarithm converts products to sums, and expectation is linear. \(\square\)

In words: each independent sample contributes exactly \(\operatorname{KL}(P \| Q)\) nats of evidence (its expected log-likelihood ratio). After \(m\) independent samples, the total evidence is the sum: \(m \cdot \operatorname{KL}(P \| Q)\) nats. This is precisely why we took logarithms in the first place — evidence from independent observations adds up. To reliably distinguish \(P\) from \(Q\), the total evidence needs to be on the order of 1 nat (roughly speaking, the log-likelihood ratio needs to be large enough to overcome the noise). So the number of samples needed scales as \(m \sim 1/\operatorname{KL}(P \| Q)\).

Example (Coin Flipping — Evidence Accumulation)

Returning to our 60%-vs-40% coin: \(\operatorname{KL}(P \| Q) = 0.081\) nats per flip. After \(m\) flips, the total evidence is \(0.081m\) nats.

  • \(m = 3\) flips: total \(= 0.24\) nats. Still mostly noise — you'd guess wrong about 40% of the time.
  • \(m = 12\) flips: total \(= 0.97\) nats. Roughly 1 nat — now you have a meaningful signal. Pinsker's inequality (next tool) will formalize this: the total variation between the two 12-sample distributions is bounded away from zero, so a good test can start to reliably distinguish them.
  • \(m = 50\) flips: total \(= 4.05\) nats. Strong evidence — you'd almost never guess wrong.

For coins with bias \(1/2 \pm \epsilon\): \(\operatorname{KL} \approx 8\epsilon^2\) per flip, so accumulating 1 nat takes \(m \approx 1/(8\epsilon^2)\) flips. With \(\epsilon = 0.01\) (51% vs 49%): \(m \approx 1{,}250\) flips. With \(\epsilon = 0.001\) (50.1% vs 49.9%): \(m \approx 125{,}000\). The difficulty scales as \(1/\epsilon^2\) — a preview of the sample complexity lower bound we're building toward.

Tool 3: Pinsker’s Inequality

We have two ways to measure how different distributions are: total variation (directly controls testing error) and KL divergence (easy to compute, additive over samples). Pinsker's inequality links them. It was proved by Mark Pinsker in 1964 in the context of information theory, and it's one of the most-used inequalities in theoretical statistics and machine learning.

Pinsker’s Inequality $$ \operatorname{TV}(P, Q) \leq \sqrt{\frac{\operatorname{KL}(P \| Q)}{2}}. $$

Why is this useful? It lets us upper-bound total variation (which controls testing error) using KL divergence (which is easy to compute for product distributions). Together with the tensorization property: if \(\operatorname{KL}(P \| Q)\) is small per sample, then even after \(m\) samples, the total variation \(\operatorname{TV}(P^m, Q^m) \leq \sqrt{m \cdot \operatorname{KL}(P\|Q)/2}\), and if this is still small, no test can distinguish them.

Proof

The proof has two steps: reduce to binary distributions (via data processing), then prove the binary case using calculus.

Step 1: Reduce to two outcomes. Let \(A = \{\omega : P(\omega) > Q(\omega)\}\). Define binary distributions \(\tilde{P} = (P(A), 1 - P(A))\) and \(\tilde{Q} = (Q(A), 1 - Q(A))\). Then \(\operatorname{TV}(P, Q) = P(A) - Q(A) = \operatorname{TV}(\tilde{P}, \tilde{Q})\).

The data processing inequality for KL divergence says that coarsening a distribution (grouping outcomes into sets) can only decrease KL: \(\operatorname{KL}(\tilde{P} \| \tilde{Q}) \leq \operatorname{KL}(P \| Q)\). (Intuitively: aggregating outcomes throws away information. Formally: this follows from the log-sum inequality, \(\sum a_i \ln(a_i/b_i) \geq (\sum a_i)\ln(\sum a_i / \sum b_i)\), applied to the groups \(A\) and \(A^c\).)

So if we prove Pinsker for the binary pair \((\tilde{P}, \tilde{Q})\), the general case follows:

$$ \operatorname{TV}(P, Q) = \operatorname{TV}(\tilde{P}, \tilde{Q}) \leq \sqrt{\operatorname{KL}(\tilde{P}\|\tilde{Q})/2} \leq \sqrt{\operatorname{KL}(P\|Q)/2}. $$

Step 2: Prove the binary case. We need: for \(P = \operatorname{Ber}(p)\) and \(Q = \operatorname{Ber}(q)\) with \(p, q \in (0,1)\):

$$ \operatorname{KL}(P\|Q) = p\ln\frac{p}{q} + (1-p)\ln\frac{1-p}{1-q} \;\geq\; 2(p - q)^2. $$

Fix \(p \in (0,1)\) and define:

$$ r(q) = \operatorname{KL}(\operatorname{Ber}(p) \| \operatorname{Ber}(q)) - 2(p - q)^2 = p\ln\frac{p}{q} + (1-p)\ln\frac{1-p}{1-q} - 2(p-q)^2. $$

We want to show \(r(q) \geq 0\) for all \(q \in (0,1)\). The strategy: find all critical points of \(r\), check that \(r \geq 0\) at each one, and use boundary behavior to conclude \(r \geq 0\) everywhere.

Critical points. Compute:

$$ r'(q) = -\frac{p}{q} + \frac{1-p}{1-q} + 4(p-q). $$

At \(q = p\): \(r'(p) = -1 + 1 + 0 = 0\). And \(r(p) = 0 - 0 = 0\). So \(q = p\) is a critical point with \(r(p) = 0\).

Is \(q = p\) a minimum? Compute:

$$ r''(q) = \frac{p}{q^2} + \frac{1-p}{(1-q)^2} - 4. $$

At \(q = p\): \(r''(p) = \frac{1}{p} + \frac{1}{1-p} - 4 = \frac{1}{p(1-p)} - 4\).

Since \(p(1-p) \leq 1/4\) for all \(p \in (0,1)\) (by AM-GM: \(p + (1-p) = 1\), so \(p(1-p) \leq (1/2)^2 = 1/4\)), we have \(1/(p(1-p)) \geq 4\), so \(r''(p) \geq 0\). The critical point \(q = p\) is a local minimum. \(\checkmark\)

Are there other critical points? Setting \(r'(q) = 0\) and clearing fractions:

$$ -\frac{p}{q} + \frac{1-p}{1-q} + 4(p-q) = 0 $$ $$ \frac{p(1-q) - (1-p)q}{q(1-q)} = 4(q - p) $$ $$ \frac{p - q}{q(1-q)} = 4(p - q). $$

If \(p \neq q\), we can divide both sides by \((p - q)\):

$$ \frac{1}{q(1-q)} = 4, \qquad \text{so } q(1-q) = \frac{1}{4}, \qquad \text{hence } q = \frac{1}{2}. $$

So the only critical points of \(r\) on \((0,1)\) are \(q = p\) and \(q = 1/2\).

Check \(r(1/2) \geq 0\).

$$ r(1/2) = p\ln(2p) + (1-p)\ln(2(1-p)) - 2\!\left(p - \tfrac{1}{2}\right)^{\!2}. $$

The first part simplifies: \(p\ln(2p) + (1-p)\ln(2(1-p)) = \ln 2 + p\ln p + (1-p)\ln(1-p) = \ln 2 - H(p)\), where \(H(p) = -p\ln p - (1-p)\ln(1-p)\) is the binary entropy. This is exactly \(\operatorname{KL}(\operatorname{Ber}(p) \| \operatorname{Ber}(1/2))\) — the divergence from the fair coin. So:

$$ r(1/2) = \operatorname{KL}(\operatorname{Ber}(p) \| \operatorname{Ber}(1/2)) - 2\!\left(p - \tfrac{1}{2}\right)^{\!2}. $$

But this is Pinsker's inequality applied to \(Q = \operatorname{Ber}(1/2)\) — is this circular? No: we can verify this specific case directly. Define \(s(p) = r(1/2)\) as a function of \(p\). Then \(s(1/2) = 0\), \(s'(p) = \ln(p/(1-p)) - 4(p - 1/2)\), and \(s'(1/2) = 0\). The second derivative is \(s''(p) = 1/(p(1-p)) - 4 \geq 0\) (same AM-GM argument). So \(p = 1/2\) is a minimum of \(s\) with \(s(1/2) = 0\). Since \(s(p) \to +\infty\) as \(p \to 0^+\) or \(p \to 1^-\), and the only other critical point would again satisfy \(p(1-p) = 1/4\), i.e., \(p = 1/2\), there are no other critical points. So \(s(p) = r(1/2) \geq 0\) for all \(p\). \(\checkmark\)

Conclusion. The function \(r(q)\) has exactly two critical points: \(q = p\) (where \(r = 0\)) and \(q = 1/2\) (where \(r \geq 0\)). Since \(r(q) \to +\infty\) as \(q \to 0^+\) or \(q \to 1^-\) (KL divergence blows up at the boundaries), and the only critical points both have \(r \geq 0\), the function cannot dip below zero anywhere. Therefore \(r(q) \geq 0\) for all \(q \in (0,1)\).

Restating: \(\operatorname{KL}(\operatorname{Ber}(p) \| \operatorname{Ber}(q)) \geq 2(p - q)^2 = 2\operatorname{TV}(P,Q)^2\). Combined with Step 1:

$$ \operatorname{TV}(P, Q) \leq \sqrt{\frac{\operatorname{KL}(P \| Q)}{2}}. \quad \square $$

Numerical Check

Let \(p = 0.7\), \(q = 0.3\):

  • \(\operatorname{KL} = 0.7\ln(7/3) + 0.3\ln(3/7) = 0.7(0.847) + 0.3(-0.847) = 0.339\).
  • \(2(p-q)^2 = 2(0.16) = 0.32\). Check: \(0.339 \geq 0.32\). \(\checkmark\)
  • Pinsker: \(\operatorname{TV} = 0.4 \leq \sqrt{0.339/2} = \sqrt{0.170} = 0.412\). \(\checkmark\)

Verify the critical point at \(q = 1/2\): \(r'(0.5) = -0.7/0.5 + 0.3/0.5 + 4(0.2) = -1.4 + 0.6 + 0.8 = 0\). \(\checkmark\)

\(r(0.5) = 0.7\ln(1.4) + 0.3\ln(0.6) - 2(0.04) = 0.236 - 0.153 - 0.08 = 0.003 \geq 0\). \(\checkmark\)

Numerical Check

\(P = \operatorname{Ber}(0.6)\), \(Q = \operatorname{Ber}(0.4)\). We computed \(\operatorname{TV} = 0.2\) and \(\operatorname{KL} = 0.081\).

Pinsker says: \(\operatorname{TV} \leq \sqrt{0.081/2} = \sqrt{0.0405} \approx 0.201\). Actual TV = 0.2. Very tight! \(\checkmark\)

Tool 4: Le Cam’s Two-Point Method

We now have all the ingredients to measure how hard it is to tell two distributions apart: total variation says how well the best test can do (Tool 1), KL divergence says how much evidence each sample provides (Tool 2), and Pinsker connects them (Tool 3). But we haven't yet connected "telling distributions apart" to "learning." That's what Le Cam's method does.

The core idea is disarmingly simple. Imagine you're a doctor diagnosing patients. There are two possible diseases, A and B, that produce nearly identical symptoms. For each disease, there's a different correct treatment. You observe the patient's symptoms (your "data") and must choose a treatment (your "output"). If the diseases produce data that looks almost the same, then no matter how clever you are, you'll prescribe the wrong treatment a significant fraction of the time — because you simply can't tell which disease you're looking at.

In the learning context: Nature picks one of two possible distributions \(P\) or \(Q\) (the "disease"). Each distribution has a different best classifier — call them \(h_P\) and \(h_Q\) (the "correct treatment"). The learner sees \(m\) samples from the chosen distribution (the "symptoms") and must output a classifier. If the \(m\)-sample distributions \(P^m\) and \(Q^m\) look nearly identical, the learner can't tell which world it's in, and will output the wrong classifier a lot of the time.

Let's make this precise with a concrete example first.

Example (The Fundamental Coin Problem)

There are two possible coins in a box. Coin A has bias \(0.6\) (heads probability). Coin B has bias \(0.4\). Someone picks a coin uniformly at random, and you see \(m\) flips from it. You must guess which coin it is.

If you see \(m = 1\) flip: even the best strategy (guess A if heads, B if tails) has error probability \(\frac{1}{2} \cdot 0.4 + \frac{1}{2} \cdot 0.4 = 0.4\). That's almost as bad as random guessing (error = 0.5).

Why 0.4? With probability 1/2, the coin is A. You see heads with probability 0.6 (correct guess) or tails with probability 0.4 (wrong guess). With probability 1/2, the coin is B. You see tails with probability 0.6 (correct guess) or heads with probability 0.4 (wrong guess). Total error = \(\frac{1}{2}(0.4) + \frac{1}{2}(0.4) = 0.4\).

We showed earlier that \(\operatorname{TV}(\operatorname{Ber}(0.6), \operatorname{Ber}(0.4)) = 0.2\), so the Testing Bound gives: error \(\geq (1 - 0.2)/2 = 0.4\). Exact match! \(\checkmark\)

Now think of this as a learning problem: under coin A, the best "classifier" (prediction rule) is "always predict heads" (\(h_A\)); under coin B, it's "always predict tails" (\(h_B\)). The learner sees data and must pick the right prediction rule. Getting it wrong means excess error of \(2 \times 0.1 = 0.2\). Le Cam formalizes this: any learner fails on at least one of the two coins with probability at least \((1 - \operatorname{TV})/2\).

Lucien Le Cam formalized this argument in 1973 as a general-purpose tool for proving lower bounds. Here is the precise statement:

Le Cam’s Two-Point Method

Suppose:

  • There are two possible "worlds": distribution \(P\) (world 1) and distribution \(Q\) (world 2).
  • In each world, there is a correct answer: \(h_P\) in world 1, \(h_Q\) in world 2, with \(h_P \neq h_Q\).
  • A learner sees \(m\) i.i.d. samples from whichever world was chosen, and outputs a guess \(\hat{h}\).

Then:

$$ \max\!\left(\Pr_{S \sim P^m}[\hat{h}(S) \neq h_P],\; \Pr_{S \sim Q^m}[\hat{h}(S) \neq h_Q]\right) \geq \frac{1 - \operatorname{TV}(P^m, Q^m)}{2}. $$

In words: no matter what the learner does, it gets the wrong answer on at least one of the two worlds with probability at least \((1 - \operatorname{TV}(P^m, Q^m))/2\).

Proof

The proof reduces learning to testing, then applies the Testing Bound from Tool 1.

Step 1: Turn the learner into a test. The learner is a function \(\hat{h}(S)\) that takes a dataset \(S\) and outputs a classifier. Since \(h_P \neq h_Q\), the learner's output implicitly declares which world it thinks it's in:

  • If \(\hat{h}(S) = h_P\), it's effectively declaring "I'm in world 1 (distribution \(P\))."
  • If \(\hat{h}(S) \neq h_P\), it's effectively declaring "I'm in world 2 (distribution \(Q\))."

This turns the learner into a hypothesis test between \(P^m\) and \(Q^m\).

Step 2: Apply the Testing Bound. Suppose Nature picks world 1 or world 2 with equal probability \(1/2\). The learner's average error (averaged over which world was chosen) is:

$$ \frac{1}{2}\Pr_{P^m}[\hat{h} \neq h_P] + \frac{1}{2}\Pr_{Q^m}[\hat{h} \neq h_Q]. $$

But this is exactly the error of the induced hypothesis test. By the Testing Bound (Tool 1):

$$ \frac{1}{2}\Pr_{P^m}[\hat{h} \neq h_P] + \frac{1}{2}\Pr_{Q^m}[\hat{h} \neq h_Q] \geq \frac{1 - \operatorname{TV}(P^m, Q^m)}{2}. $$

Step 3: From average to max. For any two numbers \(a\) and \(b\): \(\max(a, b) \geq \frac{a+b}{2}\). So:

$$ \max\!\left(\Pr_{P^m}[\hat{h} \neq h_P],\; \Pr_{Q^m}[\hat{h} \neq h_Q]\right) \geq \frac{1}{2}\!\left(\Pr_{P^m}[\hat{h} \neq h_P] + \Pr_{Q^m}[\hat{h} \neq h_Q]\right) \geq \frac{1 - \operatorname{TV}(P^m, Q^m)}{2}. \;\square $$

Why is this powerful? It converts a learning problem (which seems complicated — infinitely many possible algorithms) into a testing problem (which information theory handles completely). The learner could be a neural network, a decision tree, or an oracle — it doesn't matter. Any learner induces a test, and no test can beat the TV bound.

How the Four Tools Chain Together

Before we dive into the lower bound proof, let's see the full chain of reasoning. Each tool feeds into the next, forming a pipeline that converts "the learning problem has structure X" into "you need at least Y samples":

The Complete Chain

  1. Construct two hard worlds — find distributions \(P\) and \(Q\) that require different classifiers but are close together.
  2. KL divergence (Tool 2) — compute how much evidence each sample provides: \(\operatorname{KL}(P \| Q)\). For coins differing by \(\epsilon\), this is \(\leq 8\epsilon^2\).
  3. Tensorization (Tool 2) — after \(m\) i.i.d. samples, the total evidence is: \(\operatorname{KL}(P^m \| Q^m) = m \cdot \operatorname{KL}(P \| Q)\).
  4. Pinsker (Tool 3) — convert evidence into distinguishability: \(\operatorname{TV}(P^m, Q^m) \leq \sqrt{m \cdot \operatorname{KL}(P\|Q)/2}\).
  5. Le Cam (Tool 4) — convert distinguishability into learning error: any learner fails on at least one world with probability \(\geq (1 - \operatorname{TV})/2\).
  6. Conclusion — if \(m\) is too small, the total evidence \(m \cdot \operatorname{KL}\) is small, so TV is small, so the error is large. No algorithm can learn.

Let's trace through the chain with concrete numbers to see it work end-to-end:

Example: The Complete Chain in Action

Can you tell if a coin has bias \(0.5 + \epsilon\) or \(0.5 - \epsilon\) from \(m\) flips? The correct predictions differ (always guess heads vs. always guess tails), so this is a learning problem with two possible worlds.

With \(\epsilon = 0.05\) and \(m = 20\) flips:

  1. Two worlds: \(P = \operatorname{Ber}(0.55)\), \(Q = \operatorname{Ber}(0.45)\).
  2. KL per sample (Tool 2): \(\operatorname{KL}(P \| Q) \leq 8(0.05)^2 = 0.02\) nats.
  3. Total KL (Tensorization): \(\operatorname{KL}(P^{20} \| Q^{20}) = 20 \times 0.02 = 0.4\) nats.
  4. TV (Pinsker): \(\operatorname{TV}(P^{20}, Q^{20}) \leq \sqrt{0.4/2} = \sqrt{0.2} \approx 0.447\).
  5. Error (Le Cam): any learner fails with probability \(\geq (1 - 0.447)/2 = 0.277\). That's a 28% failure rate — way too high for reliable learning.

How many flips do we need? For the error to drop to, say, \(\leq 0.05\), we need \((1 - \operatorname{TV})/2 \leq 0.05\), i.e., \(\operatorname{TV} \geq 0.9\). But Pinsker only gives an upper bound on TV, so this direction doesn't directly help — however, the contrapositive is what we use: if \(m\) is small enough that TV \(\leq\) some constant, then the error is bounded away from zero, proving learning is impossible with that many samples.

For TV to be at most \(1/2\): \(\sqrt{m \cdot 0.02/2} \leq 1/2\), so \(m \leq 25\). With \(m \leq 25\) flips, TV \(\leq 1/2\), so error \(\geq (1 - 1/2)/2 = 1/4\). Fewer than ~25 flips means at least 25% error rate. No algorithm can do better.

Proving the Lower Bound: \(m \geq C_1(d + \ln(1/\delta))/\epsilon^2\)

We now assemble the tools. The strategy is a multi-dimensional version of the coin example above: instead of one coin to identify, the learner must identify \(d\) coins simultaneously (one for each shattered point). This is a simplified version of Assouad's method (1983).

The two parts of the lower bound have distinct flavors:

  • Part A (\(d/\epsilon^2\) term): Identifying \(d\) independent noisy bits requires \(\Omega(d/\epsilon^2)\) samples. This uses the full Le Cam + Pinsker chain, applied \(d\) times (once per bit).
  • Part B (\(\ln(1/\delta)/\epsilon^2\) term): Even for a single bit, boosting success from constant probability to \(1 - \delta\) requires \(\Omega(\ln(1/\delta)/\epsilon^2)\) samples. This is a direct Hoeffding argument.

Theorem (Sample Complexity Lower Bound) If \(\operatorname{VCdim}(\mathcal{H}) = d \geq 1\), then for \(\epsilon\) small enough and any learning algorithm: $$ m_{\mathcal{H}}(\epsilon, \delta) \;\geq\; C_1 \frac{d + \ln(1/\delta)}{\epsilon^2} $$ where \(C_1 > 0\) is an absolute constant.

We prove the two terms separately. Part A proves the \(d/\epsilon^2\) term — the main content. Part B proves the \(\ln(1/\delta)/\epsilon^2\) term. We'll use a running concrete example with \(d = 3\) alongside the general proof so every step has a tangible instantiation.

Proof — Part A: \(m \geq \Omega(d/\epsilon^2)\)

The plan in one sentence: We design a "hard exam" where the learner must figure out \(d\) unknown bits from noisy data. We show that each bit requires \(\sim 1/\epsilon^2\) samples to identify, so the total cost is \(\sim d/\epsilon^2\).

Step 1: Find \(d\) shattered points.

Recall from Section 8 what "VC dimension \(d\)" means: there exist \(d\) points that \(\mathcal{H}\) shatters — meaning for every possible labeling of these \(d\) points with 0s and 1s, there's some hypothesis in \(\mathcal{H}\) that achieves that labeling. All \(2^d\) labelings are realizable.

Call these shattered points \(x_1, x_2, \ldots, x_d\).

Running Example (\(d = 3\))

Suppose \(\mathcal{H}\) is a class of classifiers with VC dimension 3, and it shatters three points: \(x_1, x_2, x_3\). This means for all 8 possible labelings — (0,0,0), (0,0,1), (0,1,0), …, (1,1,1) — there is some \(h \in \mathcal{H}\) that produces that labeling on these three points. For instance, \(\mathcal{H}\) might be the set of linear classifiers in \(\mathbb{R}^2\) (which has VC dimension 3), and \(x_1, x_2, x_3\) might be three non-collinear points in the plane.

Step 2: Design \(2^d\) "worlds" that the learner must distinguish.

Nature is going to pick a secret binary string \(b = (b_1, b_2, \ldots, b_d)\) — think of it as a secret answer key with \(d\) entries. Each entry is 0 or 1, so there are \(2^d\) possible answer keys. The learner's job is to figure out \(b\) from noisy data.

For each possible answer key \(b\), define a distribution \(\mathcal{D}_b\) that generates labeled data as follows:

  1. Pick a random point: Choose one of the \(d\) shattered points uniformly at random. Each point \(x_i\) is chosen with probability \(1/d\).
  2. Generate a noisy label: At point \(x_i\), the "correct" answer is \(b_i\) (the \(i\)-th entry of the secret string). But the label is noisy: with probability \(1/2 + \epsilon\), the label matches the correct answer (\(Y = b_i\)), and with probability \(1/2 - \epsilon\), it's the wrong answer (\(Y = 1 - b_i\)).

Each draw from \(\mathcal{D}_b\) produces a pair \((x_i, Y)\): a random point and a noisy label. The learner collects \(m\) such pairs and must figure out the secret string \(b\).

Running Example (\(d = 3\), \(\epsilon = 0.05\))

Suppose the secret string is \(b = (1, 0, 1)\). Then the distribution \(\mathcal{D}_{(1,0,1)}\) works like this:

  • With probability 1/3, you see point \(x_1\). The label is \(Y = 1\) (the correct answer for position 1) with probability \(0.55\), and \(Y = 0\) with probability \(0.45\).
  • With probability 1/3, you see point \(x_2\). The label is \(Y = 0\) (the correct answer for position 2) with probability \(0.55\), and \(Y = 1\) with probability \(0.45\).
  • With probability 1/3, you see point \(x_3\). The label is \(Y = 1\) (the correct answer for position 3) with probability \(0.55\), and \(Y = 0\) with probability \(0.45\).

A possible training set of \(m = 9\) samples might look like:

\((x_2, 0),\; (x_1, 1),\; (x_3, 0),\; (x_1, 1),\; (x_2, 1),\; (x_3, 1),\; (x_2, 0),\; (x_1, 0),\; (x_3, 1)\).

For point \(x_1\), the learner sees labels (1, 1, 0) — two 1s and one 0. Since the secret answer for \(x_1\) is 1, and 1 appears with probability 0.55, this is consistent. But with only 3 observations of \(x_1\), it's hard to be sure whether the bias is 0.55 (secret = 1) or 0.45 (secret = 0).

Step 3: The best classifier for each world, and why getting it wrong is costly.

Under distribution \(\mathcal{D}_b\), the best possible classifier is the one that predicts the more likely label at each point: predict \(b_i\) at point \(x_i\). Call this classifier \(h_b\). Since \(\mathcal{H}\) shatters \(\{x_1, \ldots, x_d\}\), the labeling \(h_b\) (which assigns \(b_i\) to \(x_i\)) is achievable by some hypothesis in \(\mathcal{H}\) — that's exactly what shattering means.

What is the error rate of \(h_b\)? At each point \(x_i\), it predicts the correct answer \(b_i\), and the label disagrees with probability \(1/2 - \epsilon\) (the noise rate). Since each point is equally likely:

$$ L_{\mathcal{D}_b}(h_b) = \frac{1}{2} - \epsilon. $$

Now suppose the learner outputs a classifier \(\hat{h}\) that gets \(k\) of the \(d\) positions wrong (i.e., \(\hat{h}(x_i) \neq b_i\) for \(k\) of the points). At a point where \(\hat{h}\) predicts correctly, its error rate is \(1/2 - \epsilon\) (same as optimal). At a point where \(\hat{h}\) predicts incorrectly, it's guessing the less likely label, so its error rate is \(1/2 + \epsilon\). The overall error is:

$$ L_{\mathcal{D}_b}(\hat{h}) = \frac{1}{d}\Big[\underbrace{(d - k)(1/2 - \epsilon)}_{\text{correct positions}} + \underbrace{k(1/2 + \epsilon)}_{\text{wrong positions}}\Big] = \left(\frac{1}{2} - \epsilon\right) + \frac{2k\epsilon}{d}. $$

The excess error compared to the best classifier is:

$$ L_{\mathcal{D}_b}(\hat{h}) - L_{\mathcal{D}_b}(h_b) = \frac{2k\epsilon}{d}. $$

For the learner to achieve excess error at most \(\epsilon\), we need \(2k\epsilon/d \leq \epsilon\), i.e., \(k \leq d/2\). The learner must correctly identify at least half of the \(d\) secret bits.

Running Example (\(d = 3\))

With secret \(b = (1, 0, 1)\) and \(\epsilon = 0.05\):

  • The optimal classifier \(h_b\) predicts (1, 0, 1) with error rate \(0.45\).
  • A classifier that gets 1 of 3 positions wrong (say it predicts (1, 0, 0) instead of (1, 0, 1)) has error \(0.45 + 2(1)(0.05)/3 = 0.45 + 0.033 = 0.483\). Excess error = 0.033, which is \(\leq \epsilon = 0.05\). OK.
  • A classifier that gets 2 of 3 positions wrong (say it predicts (0, 0, 0)) has error \(0.45 + 2(2)(0.05)/3 = 0.45 + 0.067 = 0.517\). Excess error = 0.067, which is \(> \epsilon = 0.05\). Not OK.

So with \(d = 3\), the learner can afford to get at most 1 position wrong (since \(d/2 = 1.5\), so \(k \leq 1\)). It must correctly identify at least 2 of the 3 secret bits.

Step 4: Why identifying each bit is a coin-flipping problem.

Focus on a single position, say position \(i\). Out of \(m\) total training samples, about \(m/d\) of them land on point \(x_i\) (since each sample picks one of the \(d\) points uniformly at random). The labels of these samples are independent coin flips:

  • If \(b_i = 1\) (the secret bit is 1): each label at \(x_i\) is 1 with probability \(1/2 + \epsilon\) and 0 with probability \(1/2 - \epsilon\). This is a coin with bias \(1/2 + \epsilon\).
  • If \(b_i = 0\) (the secret bit is 0): each label at \(x_i\) is 0 with probability \(1/2 + \epsilon\) and 1 with probability \(1/2 - \epsilon\). This is a coin with bias \(1/2 - \epsilon\) (for showing 1).

So to figure out bit \(i\), the learner has \(\sim m/d\) coin flips and must determine whether the coin has bias \(1/2 + \epsilon\) or \(1/2 - \epsilon\). This is exactly the coin-testing problem from the "Complete Chain in Action" example above.

Running Example (\(d = 3\), \(m = 9\))

With \(m = 9\) samples and \(d = 3\) points, about \(9/3 = 3\) samples land on each point. For point \(x_1\) with secret bit \(b_1 = 1\), the learner sees ~3 labels, each independently 1 with probability 0.55 and 0 with probability 0.45.

With only 3 coin flips of a 55%-vs-45% coin, distinguishing it from a 45%-vs-55% coin is very hard. You might see (1, 1, 0) — is that 55% heads or 45% heads? Both are very plausible.

Step 5: Measure how much evidence the data provides (KL divergence).

Now we apply the tools from earlier. Consider two specific worlds that differ only in position \(i\): secret strings \(b\) and \(b'\) that are identical except \(b_i \neq b'_i\). (For instance, \(b = (1,0,1)\) and \(b' = (0,0,1)\), differing only in position 1.)

When the learner sees a sample that lands on some point \(x_j\) with \(j \neq i\), both worlds produce exactly the same distribution of labels (since \(b_j = b'_j\)). That sample provides zero evidence about whether bit \(i\) is 0 or 1.

Only the samples that land on point \(x_i\) provide evidence. Each such sample is a coin flip from either \(\operatorname{Ber}(1/2 + \epsilon)\) (if \(b_i = 1\)) or \(\operatorname{Ber}(1/2 - \epsilon)\) (if \(b_i = 0\)). From Tool 2 (the Small-Bias Bernoulli KL lemma), each such flip provides at most \(8\epsilon^2\) nats of evidence.

How many samples land on \(x_i\)? About \(m/d\) (since each of the \(m\) samples picks point \(x_i\) with probability \(1/d\)). By the tensorization property of KL divergence (Tool 2: evidence from independent samples adds up), the total evidence for bit \(i\) after seeing all \(m\) samples is:

$$ \operatorname{KL}(\mathcal{D}_b^m \| \mathcal{D}_{b'}^m) \leq \frac{m}{d} \cdot 8\epsilon^2 = \frac{8m\epsilon^2}{d} \text{ nats}. $$

Running Example (\(d = 3\), \(m = 9\), \(\epsilon = 0.05\))

Evidence for each bit: \(\frac{8 \times 9 \times 0.0025}{3} = \frac{0.18}{3} = 0.06\) nats.

That's very little evidence. Recall from Tool 2 that you need roughly 1 nat of total evidence to reliably distinguish two distributions. With only 0.06 nats, the learner has almost no chance of correctly identifying this bit.

Step 6: Convert evidence to distinguishability (Pinsker).

From Tool 3 (Pinsker's inequality): total variation \(\leq \sqrt{\operatorname{KL}/2}\). Applying this to the full \(m\)-sample distributions for two worlds differing in bit \(i\):

$$ \operatorname{TV}(\mathcal{D}_b^m, \mathcal{D}_{b'}^m) \leq \sqrt{\frac{8m\epsilon^2/d}{2}} = \sqrt{\frac{4m\epsilon^2}{d}} = \frac{2\epsilon\sqrt{m}}{\sqrt{d}}. $$

Recall what total variation means (Tool 1): it's the largest possible difference in probability of any event between the two distributions. If TV is 0.1, then no event — no pattern in the data, no matter how cleverly chosen — has probability differing by more than 0.1 between the two worlds. The two worlds produce almost indistinguishable data.

Running Example (\(d = 3\), \(m = 9\), \(\epsilon = 0.05\)) $$ \operatorname{TV} \leq \frac{2 \times 0.05 \times 3}{\sqrt{3}} = \frac{0.3}{1.73} = 0.173. $$

The two worlds (differing in one bit) produce datasets that are at most 17.3% different in probability for any event. Very hard to tell apart.

Step 7: Convert distinguishability to error (Le Cam).

From Tool 4 (Le Cam's method): any algorithm that tries to determine bit \(i\) — i.e., to determine which of two worlds \(\mathcal{D}_b\) or \(\mathcal{D}_{b'}\) generated the data — has error probability at least \((1 - \operatorname{TV})/2\). Plugging in our TV bound:

$$ \Pr[\text{wrong about bit } i] \geq \frac{1 - \frac{2\epsilon\sqrt{m}}{\sqrt{d}}}{2}. $$

This holds for any algorithm — neural networks, decision trees, anything. It's an information-theoretic limit: the data simply doesn't contain enough evidence.

Running Example (\(d = 3\), \(m = 9\), \(\epsilon = 0.05\)) $$ \Pr[\text{wrong about bit } i] \geq \frac{1 - 0.173}{2} = \frac{0.827}{2} = 0.414. $$

The learner has at least a 41.4% chance of getting any given bit wrong. With 3 bits to identify and a 41.4% error rate per bit, the learner will make a lot of mistakes.

Step 8: Compute the learner's expected excess error.

Imagine Nature picks the secret string \(b\) uniformly at random from all \(2^d\) possibilities. We'll compute the learner's expected excess error, averaged over (a) Nature's random choice of \(b\) and (b) the random training data \(S\).

Let \(W\) = number of positions where the learner's classifier disagrees with the optimal classifier \(h_b\). From Step 7, for each position \(i\), the learner gets bit \(i\) wrong with probability at least \(p_0 = (1 - 2\epsilon\sqrt{m/d})/2\). By linearity of expectation (the expected value of a sum equals the sum of expected values, even when the summands are not independent):

$$ \mathbb{E}_{b,S}[W] \geq d \cdot p_0 = d \cdot \frac{1 - \frac{2\epsilon\sqrt{m}}{\sqrt{d}}}{2}. $$

From Step 3, the excess error equals \(2\epsilon W / d\). So the expected excess error is:

$$ \mathbb{E}_{b,S}[\text{excess error}] = \frac{2\epsilon}{d} \cdot \mathbb{E}[W] \geq \frac{2\epsilon}{d} \cdot d \cdot p_0 = 2\epsilon \cdot p_0 = \epsilon\!\left(1 - \frac{2\epsilon\sqrt{m}}{\sqrt{d}}\right). $$

For \(m < d/(16\epsilon^2)\): the quantity \(2\epsilon\sqrt{m/d} < 2\epsilon \cdot 1/(4\epsilon) = 1/2\), so:

$$ \mathbb{E}_{b,S}[\text{excess error}] > \epsilon \cdot (1 - 1/2) = \frac{\epsilon}{2}. $$

What does this mean? Averaged over Nature's random choice of secret and the random training data, the learner's excess error exceeds \(\epsilon/2\). No matter how clever the algorithm is, the average performance is bad.

Running Example (\(d = 3\), \(m = 9\), \(\epsilon = 0.05\))

\(2\epsilon\sqrt{m/d} = 0.1 \cdot \sqrt{3} = 0.173\). So \(p_0 = (1 - 0.173)/2 = 0.414\).

Expected excess error \(\geq 2 \times 0.05 \times 0.414 = 0.041\). Compare to the accuracy target \(\epsilon = 0.05\): the learner's average excess error is 82% of the target. That's terrible — on average, the learner nearly exhausts its entire error budget.

Step 9: Find a specific hard distribution.

Step 8 showed that the average excess error over random \(b\) exceeds \(\epsilon/2\). But the learner needs to work for every distribution, not just on average. We now show there's a specific secret \(b^*\) that's hard.

The averaging argument is simple: if the average of a collection of numbers exceeds \(\epsilon/2\), then at least one of those numbers exceeds \(\epsilon/2\). (If every number were \(\leq \epsilon/2\), the average couldn't exceed \(\epsilon/2\).) So there exists a specific secret string \(b^*\) such that:

$$ \mathbb{E}_{S}[\text{excess error under } \mathcal{D}_{b^*}] > \frac{\epsilon}{2}. $$

For this specific distribution \(\mathcal{D}_{b^*}\), the learner's expected excess error exceeds \(\epsilon/2\), no matter which algorithm is used.

Step 10: Convert expected error into a failure probability.

We know: for distribution \(\mathcal{D}_{b^*}\), the learner's excess error (call it \(X\)) is a random variable (depending on the random training set \(S\)) with:

  • \(\mathbb{E}[X] > \epsilon/2\) (from Step 9).
  • \(X \geq 0\) (excess error is non-negative).
  • \(X \leq 2\epsilon\) (the worst case is getting all \(d\) bits wrong, giving excess error \(2d\epsilon/d = 2\epsilon\)).

We want to show that \(X\) is large with noticeable probability, not just on average. The tool is a truncated expectation bound: split the expectation based on whether \(X\) is above or below a threshold \(t\).

$$ \mathbb{E}[X] = \mathbb{E}[X \cdot \mathbf{1}_{X \leq t}] + \mathbb{E}[X \cdot \mathbf{1}_{X > t}]. $$

The first term is at most \(t \cdot 1 = t\) (since \(X \leq t\) when the indicator is 1, and the indicator has probability at most 1). The second term is at most \(2\epsilon \cdot \Pr[X > t]\) (since \(X \leq 2\epsilon\) always). So:

$$ \mathbb{E}[X] \leq t + 2\epsilon \cdot \Pr[X > t]. $$

Rearranging:

$$ \Pr[X > t] \geq \frac{\mathbb{E}[X] - t}{2\epsilon}. $$

Choosing the threshold \(t\). The bound \(\Pr[X > t] \geq (\mathbb{E}[X] - t)/(2\epsilon)\) works for any \(t\), but we need \(t < \mathbb{E}[X]\) (otherwise the right side is zero or negative and the bound says nothing). Since \(\mathbb{E}[X] > \epsilon/2\), any \(t < \epsilon/2\) works. The choice of \(t\) involves a tradeoff:

  • Smaller \(t\): the numerator \(\mathbb{E}[X] - t\) is larger, giving a stronger probability bound, but the accuracy guarantee ("\(X > t\)") is weaker.
  • Larger \(t\) (closer to \(\epsilon/2\)): the accuracy guarantee is stronger, but the probability bound shrinks toward zero.

We pick \(t = \epsilon/4\) — the midpoint of \([0, \epsilon/2]\) — to balance these. Any choice in the range \((0, \epsilon/2)\) would give the same \(\Omega(d/\epsilon^2)\) scaling; only the constant \(C_1\) changes.

With \(t = \epsilon/4\) and \(\mathbb{E}[X] > \epsilon/2\):

$$ \Pr[\text{excess error} > \epsilon/4] \geq \frac{\epsilon/2 - \epsilon/4}{2\epsilon} = \frac{\epsilon/4}{2\epsilon} = \frac{1}{8}. $$

In words: the learner's excess error exceeds \(\epsilon/4\) with probability at least \(1/8\). An algorithm that aims for excess error at most \(\epsilon/4\) with 87.5% success cannot exist when \(m < d/(16\epsilon^2)\).

Running Example (\(d = 3\), \(m = 9\), \(\epsilon = 0.05\))

\(\mathbb{E}[X] > 0.041\), \(t = \epsilon/4 = 0.0125\), max \(X = 2\epsilon = 0.1\).

$$ \Pr[X > 0.0125] \geq \frac{0.041 - 0.0125}{0.1} = \frac{0.0285}{0.1} = 0.285. $$

The learner has at least a 28.5% chance of excess error above 0.0125 — with only 9 samples, reliable learning of 3 noisy bits is impossible. \(\checkmark\)

Step 11: State the lower bound (absorbing constants).

We showed: if \(m < d/(16\epsilon^2)\), then for some distribution \(\mathcal{D}_{b^*}\), any learner has excess error \(> \epsilon/4\) with probability \(\geq 1/8\).

This is a lower bound on \(m_{\mathcal{H}}(\epsilon/4, \, 7/8)\) — the sample complexity for accuracy \(\epsilon/4\) and confidence \(7/8\). To express it in terms of general \(\epsilon\), substitute \(\epsilon' = \epsilon/4\) (so \(\epsilon = 4\epsilon'\)):

$$ m_{\mathcal{H}}(\epsilon', 7/8) \geq \frac{d}{16 \cdot (4\epsilon')^2} = \frac{d}{256\,\epsilon'^2}. $$

Dropping the prime and writing \(C_1 = 1/256\):

$$ m_{\mathcal{H}}(\epsilon, \delta) \geq \frac{d}{256\,\epsilon^2} \quad \text{for } \delta \leq 7/8. $$

The constant 256 can be improved to 64 with a tighter analysis (using exact Le Cam bounds instead of Pinsker's slightly loose upper bound on TV). The key point is the scaling: the lower bound is \(\Omega(d/\epsilon^2)\), matching the upper bound from Section 10 up to constants and the \(\ln(1/\epsilon)\) factor. \(\square\) (Part A)

Proof — Part B: \(m \geq \ln(1/(2\delta))/(2\epsilon^2)\)

What Part B says: Even if the hypothesis class is trivially simple (\(d = 1\), just a single threshold), achieving high confidence (small failure probability \(\delta\)) requires \(\Omega(\ln(1/\delta)/\epsilon^2)\) samples. Part A showed that learning \(d\) bits is hard; Part B shows that learning even one bit reliably is hard.

Setup: Take a single shattered point \(x_1\) and consider just two worlds:

  • World 0: the label at \(x_1\) is 0 with probability \(1/2 + \epsilon\) and 1 with probability \(1/2 - \epsilon\). The best classifier predicts \(\hat{h}(x_1) = 0\).
  • World 1: the label at \(x_1\) is 1 with probability \(1/2 + \epsilon\) and 0 with probability \(1/2 - \epsilon\). The best classifier predicts \(\hat{h}(x_1) = 1\).

All \(m\) training samples are at \(x_1\) (it's the only point), so the learner sees a sequence of labels \(Y_1, Y_2, \ldots, Y_m\). In World 0, each \(Y_i\) is 0 with probability \(0.5 + \epsilon\). In World 1, each \(Y_i\) is 1 with probability \(0.5 + \epsilon\). The learner must determine which world it's in.

The natural test: Compute the fraction of 1s: \(\bar{Y} = \frac{1}{m}\sum_{i=1}^m Y_i\). In World 0, \(\mathbb{E}[\bar{Y}] = 1/2 - \epsilon\). In World 1, \(\mathbb{E}[\bar{Y}] = 1/2 + \epsilon\). So the learner should guess World 0 if \(\bar{Y} < 1/2\) and World 1 if \(\bar{Y} > 1/2\).

When does this test fail? The test fails when \(\bar{Y}\) lands on the wrong side of \(1/2\). By Hoeffding's inequality (Section 4: for bounded i.i.d. random variables with \(Y_i \in [0,1]\), \(\Pr[|\bar{Y} - \mathbb{E}[\bar{Y}]| \geq t] \leq 2e^{-2mt^2}\)):

$$ \Pr[\text{test fails}] = \Pr\!\left[|\bar{Y} - \mathbb{E}[\bar{Y}]| \geq \epsilon\right] \leq 2e^{-2m\epsilon^2}. $$

For this failure probability to be at most \(\delta\), we need:

$$ 2e^{-2m\epsilon^2} \leq \delta \quad \Longrightarrow \quad e^{-2m\epsilon^2} \leq \frac{\delta}{2} \quad \Longrightarrow \quad 2m\epsilon^2 \geq \ln\!\frac{2}{\delta} = \ln\frac{1}{(\delta/2)} \quad \Longrightarrow \quad m \geq \frac{\ln(2/\delta)}{2\epsilon^2} = \frac{\ln(1/(2\delta)) + \ln 2}{2\epsilon^2}. $$

Since \(\ln 2 > 0\), this gives \(m \geq \ln(1/(2\delta))/(2\epsilon^2)\). (We can also write it as \(m \geq \ln(2/\delta)/(2\epsilon^2)\) — the difference is just a small constant.)

Why this is a lower bound, not just an analysis of one test: The Hoeffding test (compare \(\bar{Y}\) to \(1/2\)) is actually the optimal test in this setting. Any other algorithm sees the same \(m\) coin flips and must determine the bias. The sufficient statistic is \(\sum Y_i\) (by the Neyman–Pearson lemma, the likelihood ratio test is optimal, and for Bernoulli data the likelihood ratio is a monotone function of \(\sum Y_i\)). So the \(2e^{-2m\epsilon^2}\) bound on the best test's failure rate is tight up to constants, and no algorithm can achieve failure probability \(\leq \delta\) with fewer than \(\Omega(\ln(1/\delta)/\epsilon^2)\) samples.

Numerical Example

To distinguish a 55%-heads coin from a 45%-heads coin (\(\epsilon = 0.05\)) with 95% confidence (\(\delta = 0.05\)):

$$ m \geq \frac{\ln(2/0.05)}{2 \times 0.0025} = \frac{\ln(40)}{0.005} = \frac{3.69}{0.005} = 738 \text{ samples}. $$

With 99% confidence (\(\delta = 0.01\)):

$$ m \geq \frac{\ln(200)}{0.005} = \frac{5.30}{0.005} = 1{,}060 \text{ samples}. $$

Going from 95% to 99% confidence costs only \(\sim 1.4\times\) more samples (since \(\ln(200)/\ln(40) \approx 1.44\)). Confidence is cheap.

\(\square\) (Part B)

Combining Parts A and B

The two requirements are independent bottlenecks: Part A says you need \(\Omega(d/\epsilon^2)\) samples to handle the complexity of the hypothesis class (\(d\) noisy bits to identify), and Part B says you need \(\Omega(\ln(1/\delta)/\epsilon^2)\) samples to guarantee high confidence (even for a single bit). Both must be satisfied simultaneously, so we add them:

$$ m_{\mathcal{H}}(\epsilon, \delta) \geq C_1 \frac{d + \ln(1/\delta)}{\epsilon^2} $$

where \(C_1\) is an absolute constant (from Part A's \(1/256\) and Part B's \(1/2\), the binding constant is the smaller one). The exact value of \(C_1\) depends on the tightness of the intermediate bounds (Pinsker, the small-bias KL lemma, the truncated Markov step), but the scaling \(\Omega((d + \ln(1/\delta))/\epsilon^2)\) is what matters — it matches the upper bound from Section 10 up to constants and the mild \(\ln(1/\epsilon)\) factor. \(\square\)

Numerical Check (Full Lower Bound)

For linear classifiers in \(\mathbb{R}^{10}\) (\(d = 11\)) with \(\epsilon = 0.05\) and \(\delta = 0.05\), using the explicit constants from Parts A and B:

  • Part A: \(m \geq 11/(256 \times 0.0025) = 11/0.64 \approx 17\) samples (for the 11 bits).
  • Part B: \(m \geq \ln(40)/(2 \times 0.0025) = 3.69/0.005 = 738\) samples (for the 95% confidence).
  • Total lower bound: \(m \geq 17 + 738 = 755\) samples.
  • Upper bound (from Section 10): \(m \leq \sim 2{,}800\) samples.

The lower bound (755) and upper bound (2,800) are within a factor of ~4. Part B (confidence) dominates for small \(d\). For higher-dimensional problems (large \(d\)), Part A (complexity) dominates instead. With tighter constants (sharper Pinsker, exact Le Cam), the gap narrows further. The key point: the \(d/\epsilon^2\) and \(\ln(1/\delta)/\epsilon^2\) scaling is correct on both sides. \(\checkmark\)

What the Lower Bound Tells Us

The proof reveals why the sample complexity is \(\Theta(d/\epsilon^2)\). There are two completely different bottlenecks, both unavoidable:

  • Complexity bottleneck (\(d/\epsilon^2\)): The shattered set provides \(d\) points where the correct label could go either way. Each point is a noisy coin whose bias (which side of 50%) the learner must determine. Each coin requires \(\sim 1/\epsilon^2\) flips, and the \(d\) coins share the \(m\) samples, so the total cost is \(d/\epsilon^2\).
  • Confidence bottleneck (\(\ln(1/\delta)/\epsilon^2\)): Even a single coin requires \(\ln(1/\delta)/\epsilon^2\) flips to achieve failure probability \(\delta\). This cost doesn't grow with \(d\) — it's the price of being confident about any one decision.

No algorithm can cheat past either bottleneck. The upper bound from Section 10 was not wasteful — it captured the right scaling.

Reading the Formula

  • \(d/\epsilon^2\): VC dimension \(d\) sets the complexity cost. Each shattered point is one "degree of freedom" that the learner must pin down, and each costs \(\sim 1/\epsilon^2\) samples because the signal is only \(\epsilon\)-strong.
  • \(\ln(1/\epsilon)\) in the upper bound: A mild logarithmic overhead from converting the VC bound's polynomial-times-exponential form into a clean sample complexity. The upper and lower bounds nearly match — the gap is just this \(\ln(1/\epsilon)\) factor.
  • \(\ln(1/\delta)/\epsilon^2\): Higher confidence costs only logarithmically more data. Going from 95% to 99.99% confidence barely changes the sample size (the ratio is \(\ln(1/0.0001)/\ln(1/0.05) = 9.21/3.00 \approx 3\times\)).

The Message VC dimension is a complete characterization of learnability for binary classification. It's not just sufficient — it's necessary. A class is learnable if and only if it can't shatter arbitrarily large sets. And all four notions — uniform convergence, agnostic PAC, realizable PAC, and finite VC dimension — are the same thing. You can't have one without the others.

Sample Complexities: Worked Computations

Let's use the upper bound from Section 10 to compute concrete sample sizes. We use the simplified form derived there:

$$ m \;\approx\; \frac{d + \ln(1/\delta)}{2\epsilon^2} $$

(this absorbs the \(\ln(1/\epsilon)\) factor and constants into an approximate rule of thumb). For all rows below, we set \(\epsilon = 0.05\) (training error within 5% of true error) and \(\delta = 0.05\) (95% confidence), so \(\ln(1/\delta) = \ln(20) \approx 3.0\).

Sample Complexities

Hypothesis classVC dimension \(d\)ComputationApproximate \(m\)
Thresholds on \(\mathbb{R}\) 1 \(\frac{1 + 3.0}{2 \times 0.0025} = \frac{4.0}{0.005}\) \(\sim 800\)
Intervals on \(\mathbb{R}\) 2 \(\frac{2 + 3.0}{0.005} = \frac{5.0}{0.005}\) \(\sim 1{,}000\)
Linear classifiers in \(\mathbb{R}^{10}\) 11 \(\frac{11 + 3.0}{0.005} = \frac{14.0}{0.005}\) \(\sim 2{,}800\)
Linear classifiers in \(\mathbb{R}^{100}\) 101 \(\frac{101 + 3.0}{0.005} = \frac{104}{0.005}\) \(\sim 20{,}800\)

These are computed from the simplified VC upper bound. The exact numbers depend on which constant \(C_2\) you use — different formulations of the VC bound give slightly different constants. The key pattern is that \(m\) scales linearly with \(d\): doubling the VC dimension roughly doubles the sample requirement.

Reading the Table

Take the row for linear classifiers in \(\mathbb{R}^{100}\): VC dimension \(d = 101\) (we proved this in Section 7: halfspaces in \(\mathbb{R}^d\) have \(\operatorname{VCdim} = d + 1\)). With ~20,800 samples:

  • Uniform convergence: For every linear classifier, training error is within 5% of true error.
  • ERM works: The classifier with the lowest training error has true error within 10% (\(2\epsilon\)) of the best possible linear classifier.
  • Agnostic guarantee: This holds even if no linear classifier fits the data well. If the best linear classifier has 20% error, ERM finds one with at most 30% error.

Note that 20,800 samples in 100 dimensions is far fewer than the number of possible linear classifiers (which is uncountably infinite). This is the power of VC theory: the sample complexity depends on VC dimension, not on the size of the class.

What We Built

We set out to answer one question: when does learning from data actually work? Twelve sections later, we have a complete answer — the Fundamental Theorem of Statistical Learning — proved from nothing but the definition of probability. Here is everything we built.

The Foundation (Sections 1–3)

We started with the language: a domain \(\mathcal{X}\), labels \(\mathcal{Y}\), an unknown distribution \(\mathcal{D}\), a hypothesis class \(\mathcal{H}\), a training set \(S\), and two notions of error — true risk \(L_{\mathcal{D}}(h)\) (how often \(h\) is wrong on fresh data) and empirical risk \(L_S(h)\) (how often \(h\) is wrong on training data). We defined PAC learning: with enough samples, can we find a hypothesis whose true error is close to the best in \(\mathcal{H}\)?

Then we forged our first tools. Markov’s inequality came from one line of algebra: a non-negative random variable can’t be large too often, because that would make its average too large. Hoeffding’s inequality required a beautiful trick — the Chernoff method: exponentiate both sides (turning a probability into an MGF), factor by independence, bound each factor using the convexity of \(e^{sx}\), and optimize the free parameter. The result: sample averages concentrate exponentially around their means.

The First Bounds (Sections 4–5)

With Hoeffding in hand, we bounded the generalization gap for a single hypothesis, then extended to finite hypothesis classes via the union bound. This was our first “learning works” theorem: with \(m = O(\log|\mathcal{H}|/\epsilon^2)\) samples, ERM finds a near-optimal hypothesis. But it only works for finite classes — and most interesting classes (linear classifiers, neural networks, decision trees) are infinite.

VC Dimension and Sauer–Shelah (Sections 6–7)

To handle infinite classes, we needed a finer notion of complexity. VC dimension measures the largest set a class can shatter — assign every possible labeling to. We computed it for thresholds (\(d = 1\)), intervals (\(d = 2\)), and linear classifiers in \(\mathbb{R}^n\) (\(d = n + 1\)). The growth function \(\Pi_{\mathcal{H}}(m)\) counts the maximum number of distinct labeling patterns on \(m\) points.

Then came the Sauer–Shelah lemma: if \(\operatorname{VCdim}(\mathcal{H}) = d\), then \(\Pi_{\mathcal{H}}(m) \leq \sum_{i=0}^d \binom{m}{i}\), which is at most \((em/d)^d\). This is the key surprise: even if a class contains uncountably many hypotheses, if it can’t shatter large sets, its effective complexity on any finite sample is polynomial, not exponential. The shattering bottleneck forces structure.

Symmetrization and the Upper Bound (Section 8)

The ghost sample trick — introducing an imaginary second sample \(S'\) drawn from the same distribution — converted the hard problem “does \(L_S(h)\) approximate \(L_{\mathcal{D}}(h)\)?” into the easier problem “does \(L_S(h)\) approximate \(L_{S'}(h)\)?” The second problem is a finite, combinatorial question: there are only \(\Pi_{\mathcal{H}}(2m)\) distinct labeling patterns to worry about, not infinitely many hypotheses. Combined with Sauer–Shelah and Hoeffding, this gave us uniform convergence and the forward direction of the Fundamental Theorem: finite VC dimension implies agnostic PAC learnability via ERM.

The Converse: Information-Theoretic Lower Bounds (Sections 9–12)

The upper bound says “finite VC dimension is sufficient for learning.” But is it necessary? Could a clever algorithm learn even when VC dimension is infinite? Sections 9–12 prove the answer is no — building an entirely different toolkit.

No-Free-Lunch (Section 9) showed that for any learner and any sample size, there exists a distribution that defeats it — when the hypothesis class is too expressive. This was the qualitative converse.

To make it quantitative, we built four information-theoretic tools from scratch (Sections 10–11):

  • Total variation distance: the maximum difference in probability between two distributions over any event. Directly controls how often an algorithm gives different answers under different distributions.
  • KL divergence: the expected log-likelihood ratio. Measures how much evidence a single observation provides for distinguishing two hypotheses — rebuilt from the detective-and-two-suspects perspective, not hand-waved with coding theory.
  • Pinsker’s inequality: converts KL divergence (easy to compute for product distributions) into total variation (directly useful for lower bounds). Proved via critical-point analysis.
  • Le Cam’s two-point method: if two “worlds” produce data with low KL divergence, no algorithm can reliably tell them apart, so any learner must err in at least one world.

Finally, Section 12 wired everything together. For a class with VC dimension \(d\), we constructed \(2^d\) carefully chosen distributions (one per bit of a shattering pattern), showed that any learner with fewer than \(\Omega(d/\epsilon^2)\) samples can’t reliably identify all \(d\) bits, and proved the matching lower bound. The argument used an averaging trick (there exists a hard instance because the average instance is hard), the truncated expectation bound (converting expected error into a failure probability), and constant absorption. The result: sample complexity is \(\Theta(d/\epsilon^2)\), matching the upper bound up to constants.

The Fundamental Theorem of Statistical Learning
A hypothesis class \(\mathcal{H}\) is agnostic PAC learnable if and only if \(\operatorname{VCdim}(\mathcal{H}) < \infty\).
The sample complexity is \(\Theta\!\left(\frac{d + \ln(1/\delta)}{\epsilon^2}\right)\), where \(d = \operatorname{VCdim}(\mathcal{H})\).

What Comes Next

The Fundamental Theorem gives a binary verdict: finite VC dimension means learnable, infinite means not. But it doesn’t tell us how fast a specific class generalizes on specific data. VC dimension is a worst-case, distribution-free quantity — it doesn’t know whether your data is “easy” or “hard.”

Part 2 introduces Rademacher complexity, a probabilistic refinement that answers a different question: how well can the hypothesis class correlate with pure random noise? Unlike VC dimension, Rademacher complexity is:

  • Data-dependent: it adapts to the actual training set, not just worst-case combinatorics.
  • Continuous: it gives a graded measure of complexity, not just a binary finite-or-infinite verdict.

Part 2 builds McDiarmid’s inequality (concentration for functions of independent variables, generalizing Hoeffding), proves the Rademacher generalization bound (using symmetrization and McDiarmid), derives Massart’s lemma (connecting counting to Rademacher bounds), and establishes the VC–Rademacher connection: that \(\mathcal{R}_m(\mathcal{H}) \leq \sqrt{2d\ln(em/d)/m}\).

And it ends with the Beautiful Inversion: the full chain — VC dimension \(\to\) growth function \(\to\) Rademacher complexity \(\to\) generalization bound \(\to\) learning — with the No-Free-Lunch theorem closing the loop. Each arrow crosses a conceptual boundary: combinatorics to counting to probability to statistics to prediction. The entire chain is an if-and-only-if, and the picture it draws is one of the most elegant in all of machine learning theory.

Appendix: Proof Details and Worked Examples

A.1 Why \(e^{sx}\) lies below the chord (Hoeffding’s Lemma, Step 1)

For \(x \in [a, b]\), the chord connecting \((a, e^{sa})\) and \((b, e^{sb})\) is:

$$ \ell(x) = e^{sa} + \frac{e^{sb} - e^{sa}}{b - a}(x - a) = \frac{b - x}{b - a}\, e^{sa} + \frac{x - a}{b - a}\, e^{sb}. $$

Since \(f(x) = e^{sx}\) is convex (\(f''(x) = s^2 e^{sx} > 0\)), we have \(f(x) \leq \ell(x)\) for all \(x \in [a,b]\). This is the definition of convexity: the function lies below any chord.

Numerical check: Let \(s = 1\), \(a = -2\), \(b = 3\), \(x = 0\).

  • \(e^{sx} = e^0 = 1\).
  • Chord: \(\frac{3-0}{5}e^{-2} + \frac{0-(-2)}{5}e^3 = 0.6 \cdot 0.135 + 0.4 \cdot 20.09 = 0.081 + 8.036 = 8.117\).

Indeed \(1 \leq 8.117\). \(\checkmark\)

A.2 Verifying the Sauer–Shelah Lemma for Intervals

\(\mathcal{H}\) = intervals on \(\mathbb{R}\), \(\operatorname{VCdim} = 2\). For \(m = 5\):

$$ \Phi(5, 2) = \binom{5}{0} + \binom{5}{1} + \binom{5}{2} = 1 + 5 + 10 = 16. $$

Direct count: on 5 ordered points \(x_1 < \cdots < x_5\), an interval \([a,b]\) labels a contiguous block as 1 and everything else as 0. The number of contiguous blocks (including the empty block) on 5 points is \(\binom{5}{2} + 5 + 1 = 16\) (choose two endpoints among the 5 points, or a single-point interval, or the empty interval).

Exact match: \(|\mathcal{H}_C| = 16 = \Phi(5, 2)\). The Sauer–Shelah bound is tight here. \(\checkmark\)