Bayes Classifiers, k-NN, and VC Dimension • TheAnig

Abstract#

The Bayes optimal classifier for discrete and continuous distributions, k-NN loss bounds, a curse of dimensionality experiment with high-dimensional Gaussians, and VC dimension analysis for disks and axis-aligned rectangles.

This was the first homework for a graduate statistical machine learning course. The problems covered the theoretical foundations that the rest of the course would build on: what does the best possible classifier look like, how does k-NN compare to it, what goes wrong in high dimensions, and how do you measure the capacity of a hypothesis class. Most of this is pen-and-paper math with a couple of coding experiments.

Bayes optimal classifier#

The setup: a feature space with three possible events $A$ , $B$ , $C$ occurring with probabilities $P(A) = 0.1$ , $P(B) = 0.4$ , $P(C) = 0.5$ . Each event has a label $+1$ or $-1$ with conditional probabilities:

$P(1 \mid A) = 0.9$ , so $P(-1 \mid A) = 0.1$
$P(1 \mid B) = 0.3$ , so $P(-1 \mid B) = 0.7$
$P(1 \mid C) = 0.8$ , so $P(-1 \mid C) = 0.2$

The Bayes optimal classifier $C^*(x)$ picks the label with higher posterior probability for each event. From Bayes’ rule:

$P(Y \mid X) = \frac{P(X \mid Y) P(Y)}{P(X)}$

Since we’re comparing posteriors for the same $x$ , the $P(X)$ denominator cancels. We just need $P(1 \mid x)P(x)$ vs $P(-1 \mid x)P(x)$ , which simplifies to comparing $P(1 \mid x)$ against $P(-1 \mid x)$ directly:

$C^*(A) = +1$ because $0.9 > 0.1$
$C^*(B) = -1$ because $0.7 > 0.3$
$C^*(C) = +1$ because $0.8 > 0.2$

The expected loss is the probability of misclassification:

$L = P(A) \cdot P(-1 \mid A) + P(B) \cdot P(1 \mid B) + P(C) \cdot P(-1 \mid C)$

$L = 0.1 \cdot 0.1 + 0.4 \cdot 0.3 + 0.5 \cdot 0.2 = 0.01 + 0.12 + 0.10 = 0.23$

So even the best possible classifier gets 23% of the samples wrong. That’s the Bayes risk for this problem, and no classifier can beat it.

Gaussian mixture decision boundary#

Now a continuous version. A probability distribution on the real line is a mixture of two classes with densities $\mathcal{N}(1, 2)$ (mean 1, variance 2) and $\mathcal{N}(4, 1)$ (mean 4, variance 1), with prior probabilities 0.4 and 0.6.

The Gaussian density is:

$f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$

For class +1: $\frac{1}{2\sqrt{\pi}} \exp\left(-\frac{(x-1)^2}{4}\right)$

For class -1: $\frac{1}{\sqrt{2\pi}} \exp\left(-\frac{(x-4)^2}{2}\right)$

The Bayes decision rule classifies based on which weighted density is larger. Setting $P(+1) \cdot f_{+1}(x) = P(-1) \cdot f_{-1}(x)$ and solving for $x$ gives the decision boundary at $x = 2.42$ .

The Bayes decision rule: predict $+1$ if $x \leq 2.42$ , predict $-1$ otherwise.

The Bayes risk is the total probability of error:

$P(\text{error}) = P(+1) \cdot P(X > 2.42 \mid +1) + P(-1) \cdot P(X \leq 2.42 \mid -1)$

Standardizing: $P(X > 2.42 \mid +1) = 1 - \Phi\left(\frac{2.42 - 1}{\sqrt{2}}\right) = 1 - \Phi(1.00) \approx 0.1587$

And: $P(X \leq 2.42 \mid -1) = \Phi\left(\frac{2.42 - 4}{1}\right) = \Phi(-1.58) \approx 0.0571$

Total error $\approx 0.4 \cdot 0.1587 + 0.6 \cdot 0.0571 \approx 0.0635 + 0.0343 \approx 0.2158$ .

Here’s what the two densities look like:

import matplotlib.mlab as mlab
import numpy as np
import matplotlib.pyplot as plt

mu1, sigma1 = 1, np.sqrt(2)
x1 = np.linspace(mu1 - 3*sigma1, mu1 + 3*sigma1, 100)
plt.plot(x1, mlab.normpdf(x1, mu1, sigma1))

mu2, sigma2 = 4, 1
x2 = np.linspace(mu2 - 3*sigma2, mu2 + 3*sigma2, 100)
plt.plot(x2, mlab.normpdf(x2, mu2, sigma2))

plt.title('Illustration of the two classes in Q2')
plt.show()

python

Two Gaussian density curves for classes +1 and -1 — The two class densities: N(1,2) in blue and N(4,1) in orange. The overlap region around x = 2.42 is where the Bayes classifier makes errors.

k-NN expected loss#

How does k-nearest neighbors compare to the Bayes optimal? For a 2-class problem with $P(Y = 1 \mid X) = p_x$ and $\eta = \min(p_x, 1 - p_x)$ , the k-NN loss is:

$L(k) = E[\eta] + E\left[(1 - 2\eta) \Pr\left\{B(k, p_x) > \frac{k}{2} \mid X\right\}\right]$

where $B(k, p_x)$ is a binomial random variable counting the number of the $k$ neighbors that belong to the minority class. The binomial PMF is:

$\Pr\{X = k\} = \binom{n}{k} p^k (1-p)^{n-k}$

The first term $E[\eta]$ is the Bayes risk $R^*$ . The second term is always non-negative, so k-NN always does at least as badly as Bayes. For $k = 3$ , the error $R^{(3)}$ is bounded by:

$R^* \leq R^{(3)} \leq 2R^*(1 - R^*)$

The upper bound $2R^*(1 - R^*)$ also applies to 1-NN. The difference is that 3-NN can be tighter in practice because majority voting among 3 neighbors smooths out some noise. But the theoretical worst case is the same.

Note that the empirical loss of 1-NN on its own training data is always zero (each point is its own nearest neighbor), which says nothing about generalization. The 3-NN empirical loss on training data is nonzero and more informative, though still optimistic.

Curse of dimensionality#

This is where the theory meets reality. The experiment: generate 2000 points from two equally weighted spherical Gaussians $\mathcal{N}(0, I)$ and $\mathcal{N}((3, 0, 0, \ldots), I)$ in $\mathbb{R}^p$ for $p = 1, 11, 21, \ldots, 101$ . Each point gets a label based on which Gaussian it came from. Train 1-NN and 3-NN classifiers, test on 1000 fresh points, and plot error rate vs. dimensionality.

Here’s the 1D case first. We generate 2000 samples from each distribution and look at their histograms:

import numpy as np
import matplotlib.pyplot as plt

NUMBER_OF_POINTS = 2000
mu1, mu2, sigma = 0, 3, 1

S1 = np.random.normal(mu1, sigma, NUMBER_OF_POINTS)
S2 = np.random.normal(mu2, sigma, NUMBER_OF_POINTS)

python

Histogram of samples from N(0,1) — Histogram of 2000 samples from N(0,1) with the theoretical density curve overlaid in red.

Histogram of samples from N(3,1) — Histogram of 2000 samples from N(3,1). The two distributions are separated by 3 standard deviations.

In 1D with means 3 apart and unit variance, these distributions are well-separated. A nearest neighbor classifier should do fine. But what happens when we keep the same separation of 3 along the first axis and add dimensions?

The function that runs the experiment for arbitrary dimensionality $p$ :

from sklearn import neighbors

def test_nn(p, N_train, N_test):
    cov = np.identity(p)
    mean1 = np.zeros(p)
    mean2 = np.zeros(p)
    mean2[0] = 3

    sample1 = np.random.multivariate_normal(mean1, cov, N_train).T
    sample2 = np.random.multivariate_normal(mean2, cov, N_train).T

    X_train = np.zeros((p, N_train))
    y_train = np.zeros(N_train)

    for i in range(N_train):
        r = np.random.random_sample()
        X_train[:, i] = sample2[:, i] if r >= 0.5 else sample1[:, i]
        y_train[i] = 1 if r >= 0.5 else 0

    nbr1 = neighbors.KNeighborsClassifier(
        n_neighbors=1, algorithm='ball_tree'
    ).fit(X_train.T, y_train)
    nbr3 = neighbors.KNeighborsClassifier(
        n_neighbors=3, algorithm='ball_tree'
    ).fit(X_train.T, y_train)

    # Generate test data the same way
    sample1 = np.random.multivariate_normal(mean1, cov, N_test).T
    sample2 = np.random.multivariate_normal(mean2, cov, N_test).T

    X_test = np.zeros((p, N_test))
    y_test = np.zeros(N_test)

    for i in range(N_test):
        r = np.random.random_sample()
        X_test[:, i] = sample2[:, i] if r >= 0.5 else sample1[:, i]
        y_test[i] = 1 if r >= 0.5 else 0

    error_nn1, error_nn3 = 0, 0
    for i in range(N_test):
        y_1 = nbr1.predict(X_test[:, i].reshape(-1, 1).T).item()
        y_3 = nbr3.predict(X_test[:, i].reshape(-1, 1).T).item()
        if y_1 != y_test[i]:
            error_nn1 += 1
        if y_3 != y_test[i]:
            error_nn3 += 1

    return (error_nn1 * 100.0 / N_test), (error_nn3 * 100.0 / N_test)

python

Then loop over dimensions:

y1, y3, x = [], [], []

for i in range(11, 110, 10):
    x.append(i)
    a, b = test_nn(i, 2000, 1000)
    y1.append(a)
    y3.append(b)

plt.plot(x, y1)
plt.plot(x, y3)
plt.legend(['1-NN', '3-NN'])
plt.title('Plot of error rate vs p')
plt.show()

python

Error rate vs dimensionality for 1-NN and 3-NN — Error rates for 1-NN and 3-NN as dimensionality increases from 11 to 101. Both classifiers converge to around 50% error, which is random guessing.

Both classifiers degrade to about 50% error (random guessing) as the dimensionality goes up, even though the class means are always 3 apart along the first coordinate. This is the curse of dimensionality. In high dimensions, the distance between any two points concentrates around the same value. The signal along dimension 1 gets drowned out by the noise from the other $p - 1$ dimensions. With 2000 training points in 101-dimensional space, the data is incredibly sparse, and nearest-neighbor distances become meaningless.

The 1-NN and 3-NN curves track each other closely, both hovering around 49-53% error for $p > 20$ or so. Having 3 neighbors instead of 1 doesn’t help when all neighbors are equally far away.

VC dimension of disks#

VC dimension measures how complex a hypothesis class is by asking: what’s the largest set of points it can shatter (classify in all $2^n$ possible ways)?

For indicator functions of disks in $\mathbb{R}^2$ (functions that are $+1$ inside a circle, $-1$ outside, but not the other way around), the analysis goes dimension by dimension.

1 point: We can always draw a disk containing the point (label $+1$ ) or not containing it (label $-1$ ). Trivially shattered.

2 points: Four labelings to handle. Both $+1$ : draw a disk with the two points diametrically opposite. Both $-1$ : any disk that misses both. One $+1$ and one $-1$ : center the disk on the $+1$ point with radius small enough to exclude the $-1$ point. All four cases work.

3 points: If they form a triangle, the circumcircle passes through all three (all $+1$ ). For cases with one or two $-1$ s, we can shrink or shift the disk to include only the $+1$ points. All $-1$ : any disk avoiding all three. This works for all 8 labelings.

4 points: Here it breaks. Consider four points forming a quadrilateral. Take the labeling where one pair of diagonal points is $+1$ and the other pair is $-1$ . Any disk containing the two $+1$ points (which are diagonal) would have to stretch across the quadrilateral, and the two $-1$ points (also diagonal) would fall inside. If you flip which diagonal pair is $+1$ , you hit the same problem from the other direction. You can’t separate diagonal pairs with a convex region.

Four points A, B, C, D with disk constructions showing shattering impossibility — Attempting to shatter 4 points with circles. When trying to include opposite diagonal points while excluding the other diagonal, the circle must be too large to avoid the excluded points.

So the VC dimension with invertible indicator functions (can be $+1$ inside or $-1$ inside) would be 3.

But the problem specifies one-sided indicators only: $+1$ inside, $-1$ outside, never the reverse. With this restriction, even 3 collinear points can’t be shattered. Place 3 points on a line with the middle one labeled $-1$ and the outer two labeled $+1$ . No disk can contain both outer points without also containing the middle one, because disks are convex.

Three collinear points showing shattering failure with one-sided disk indicators — Three collinear points with labels +1, -1, +1. No circle can contain both outer points without containing the middle one, since circles are convex.

With the one-sided restriction, the VC dimension is 2.

VC dimension of axis-aligned rectangles#

For indicator functions of axis-aligned rectangular boxes in $\mathbb{R}^2$ ( $+1$ inside the box, $-1$ outside):

1 point: Trivial. Box around it for $+1$ , box elsewhere for $-1$ .

2 points: All 4 labelings work. We can make a box containing any subset of the two points.

3 points: Any triangle formed by 3 points can be enclosed in an axis-aligned bounding box. For subsets, we can make tighter boxes that include only the desired points. All 8 labelings are achievable.

4 points: Assign the points to roles by their extremal positions: topmost, bottommost, leftmost, rightmost. (If a point is both leftmost and topmost, some cases degenerate to fewer than 4 distinct extreme points, which reduces to the 3-point case.) For 4 distinct extreme points, the bounding box with width from leftmost to rightmost and height from bottommost to topmost contains all four. Any subset of 1, 2, or 3 points can be enclosed by adjusting the box boundaries. The empty set is just a box that misses all points. All 16 labelings work.

5 points: With 5 points, at least one point must lie inside the bounding box defined by the other four extreme points (top, bottom, left, right). You can’t label that interior point as $-1$ while labeling all four boundary-defining points as $+1$ , because the bounding box that contains the four extremal points necessarily contains the interior point too.

The VC dimension is 4.

The jump from 3 (for disks) to 4 (for rectangles) makes sense geometrically. Rectangles have 4 independent parameters (left, right, top, bottom edges), while circles have 3 (center x, center y, radius). VC dimension tends to track the number of free parameters, though that’s a heuristic, not a theorem.