Kernel Methods and Random Fourier Features

Abstract#

Linear SVM vs least squares classification, kernel methods with Gaussian and Laplacian kernels, random Fourier features for kernel approximation, a proof that $L^2[0,1]$ is not an RKHS, and a discussion of smoothing splines as kernel methods.

This was the second homework for statistical machine learning, and the problems all orbit around the same question: how do you get a nonlinear classifier out of a linear framework? The dataset is MNIST digits 7 vs 9 (2000 training images, 2000 test images, each 28x28 = 784 pixels). We start with a linear SVM baseline, build up to kernel methods, then approximate the kernel with random features, and finish with some theory on reproducing kernel Hilbert spaces and smoothing splines.

The dataset#

The 79.mat file contains 2000 training images from MNIST: 1000 sevens and 1000 nines, each flattened to a 784-dimensional vector. The test set test79.mat has the same structure. Labels are encoded as $+1$ for nines and $-1$ for sevens (or 7/9 directly, depending on which part of the code you’re looking at).

import scipy.io
import numpy as np

train = scipy.io.loadmat('79.mat')
test = scipy.io.loadmat('test79.mat')

X = np.array(train['d79'])       # (2000, 784)
test_X = np.array(test['d79'])   # (2000, 784)

y = np.append(np.ones(1000) * 7, np.ones(1000) * 9)

python

Linear SVM#

The simplest baseline. Scikit-learn’s LinearSVC with default parameters:

from sklearn.svm import LinearSVC

clf = LinearSVC(random_state=0)
clf.fit(X, y)
print(clf.score(test_X, y))  # 0.926

python

92.6% accuracy. Not bad for a linear model on raw pixels, but that means about 148 test images get misclassified. Sevens and nines do look similar (especially sloppy handwriting), so a linear boundary in 784-dimensional pixel space has limits.

Least squares classifier#

Instead of SVM, we can frame classification as a least squares problem. Encode the labels as one-hot vectors: $[1, -1]$ for sevens and $[-1, 1]$ for nines. Then solve for a weight matrix $W$ that maps augmented inputs $\tilde{x} = [1, x]$ to these label vectors.

The objective is:

$W^* = \arg\min_W \sum_{i=1}^n \|\tilde{x}_i^T W - y_i\|^2$

In closed form: $W = (X^T X)^{-1} X^T Y$ . But $X^T X$ might be singular (784 features, 2000 samples), so we add Tikhonov regularization:

$W = (X^T X + \sigma I)^{-1} X^T Y$

with $\sigma = 0.001$ :

from numpy.linalg import inv, matrix_rank

def train_lss(X, y):
    D = X.shape[1] + 1  # +1 for bias
    K = y.shape[1]       # number of classes

    sum1 = np.zeros((D, D))
    sum2 = np.zeros((D, K))

    for i, x_i in enumerate(X):
        x_i = np.append(1, x_i)  # prepend bias term
        sum1 += np.outer(x_i, x_i)
        sum2 += np.outer(x_i, y[i])

    # Add regularization until full rank
    while matrix_rank(sum1) != D:
        sum1 = sum1 + 0.001 * np.eye(D)

    return np.dot(inv(sum1), sum2)

python

Prediction takes the class with the higher output value:

def predict(W, x):
    x = np.append(1, x)
    values = list(np.dot(W.T, x))
    winner = values.index(max(values))
    y = [0 for _ in values]
    y[winner] = 1
    return y

python

Result: 93.85% accuracy (1877 out of 2000 correct). The least squares classifier beats the linear SVM by about 1.2 percentage points. Both are linear models, but they optimize different loss functions. SVM uses hinge loss and finds the maximum-margin hyperplane; least squares minimizes squared error to one-hot targets. On this particular dataset, the squared error approach works a bit better.

Gaussian kernel least squares#

Linear classifiers top out around 93%. To go further, we need nonlinearity. The kernel trick replaces the dot product $x_i^T x_j$ with a kernel function $k(x_i, x_j)$ that implicitly maps the data to a higher-dimensional space where a linear boundary might work better.

The Gaussian (RBF) kernel:

$k(x, y) = \exp\left(-\frac{\|x - y\|^2}{2\sigma^2}\right)$

The kernel least squares classifier works by building a kernel matrix $K$ where $K_{ij} = k(x_i, x_j)$ , then solving for coefficients $\alpha$ :

$\alpha = (K + \lambda I)^{-1} y$

where $\lambda$ is a regularization parameter. At prediction time, for a new point $x$ :

$f(x) = \sum_{j=1}^n \alpha_j \cdot k(x_j, x)$

and the sign of $f(x)$ gives the predicted class.

from sklearn.preprocessing import scale

def gaussian_kernel(xi, xj, sigma=2.5):
    return np.exp(-np.linalg.norm(xi - xj)**2 * 0.5 / sigma)

def gauss_lss(X, y, penalty=1e-4, sigma=1e+4):
    X = scale(X)
    N = X.shape[0]
    K = np.zeros((N, N))
    for i in range(N):
        for j in range(N):
            K[i, j] = gaussian_kernel(X[i,:], X[j,:], sigma)

    KI = np.linalg.inv(np.add(K, penalty * np.identity(N)))
    return np.matmul(KI, y)

python

With $\lambda = 10^{-4}$ and $\sigma = 10^4$ , this gets 98.1% accuracy. That’s a big jump from the linear models.

Laplacian kernel#

The Laplacian kernel uses L1 norm instead of L2:

$k(x, y) = \exp\left(-\frac{\|x - y\|}{\sigma}\right)$

The implementation is almost identical, just swap out the norm calculation:

def laplace_kernel(xi, xj, sigma=2.5):
    return np.exp(-np.linalg.norm(xi - xj) / sigma)

python

Result: 97.95% accuracy. Basically the same as the Gaussian kernel on this dataset. The Laplacian kernel produces a rougher feature space (it corresponds to a Matern kernel with $\nu = 1/2$ ), but for 7-vs-9 classification the difference is negligible.

Parameter sensitivity#

Kernel methods have two hyperparameters to tune: $\lambda$ (regularization) and $\sigma$ (kernel bandwidth).

Setting $\lambda = 10^{-8}$ (weaker regularization) drops accuracy to 97.75%. The model fits training data more tightly but generalizes slightly worse. The optimal $\lambda = 10^{-4}$ was found through cross-validation.

The bandwidth $\sigma$ controls how “wide” the kernel is. If $\sigma$ is too small, each point only “sees” its immediate neighbors, and the classifier becomes overly local. If $\sigma$ is too large, every point looks similar to every other point, and the kernel matrix approaches a matrix of all ones, throwing away all structure.

$\sigma$	Accuracy
100	97.85%
1,000	98.25%
10,000	98.1%
100,000	95.8%

The sweet spot is around $\sigma = 1000$ . At $\sigma = 100000$ , the kernel devalues each pairwise distance so much that the contributions become too uniform to be useful.

Random Fourier features#

The kernel matrix approach has a problem: for $n$ training points, you need an $n \times n$ matrix ( $2000 \times 2000 = 4$ million entries) and its inverse. That doesn’t scale. Random Fourier features (Rahimi and Recht, 2007) approximate the kernel with a finite-dimensional explicit feature map, turning the kernel method back into a linear one in a new feature space.

The idea: for a shift-invariant kernel like the Gaussian, Bochner’s theorem guarantees that the kernel can be written as:

$k(x, y) = E_w[\cos(w^T(x - y))]$

where $w$ is drawn from the Fourier transform of the kernel. We approximate this expectation by sampling $D$ random vectors $w_1, \ldots, w_D$ and biases $b_1, \ldots, b_D$ :

$z(x) = \cos(\gamma w^T x + b)$

where $w \in \mathbb{R}^{D \times d}$ , $b \sim \text{Uniform}(0, 2\pi)$ , and $\gamma$ is a scaling parameter. Then $z(x)^T z(y) \approx k(x, y)$ for large enough $D$ .

Once we have $z(x)$ for each training point, it’s just a linear least squares problem in $D$ dimensions instead of an $n \times n$ kernel problem:

$(Z Z^T + \lambda I) \alpha = Z y$

import math

def UseCrossValidation(data_scaled):
    data = {}
    for D in range(100, 700, 100):
        data[D] = {}
        maxV, maxG, maxP = 0, 0, 0
        w = np.random.randn(D, 784)
        b = 2 * math.pi * np.random.rand(D, 1)

        for penalty in [1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1]:
            gamma = 1e-6
            while gamma <= 0.01:
                X_train = data_scaled[:, 0:1800]
                y_train = output_shuffle[0:1800]
                X_val = data_scaled[:, 1800:2000]
                y_val = output_shuffle[1800:2000]

                # Compute random features
                z = np.cos(gamma * np.matmul(w, X_train)
                           + b * np.ones((1, 1800)))

                # Solve linear system
                mat = np.add(
                    np.matmul(z, z.T),
                    penalty * np.identity(D)
                )
                alphas = np.linalg.lstsq(
                    mat, np.matmul(z, y_train)
                )[0].reshape([D, 1])

                # Validate
                cnt = 0
                for i in range(200):
                    x_test = X_val[:, i].reshape([784, 1])
                    h = np.cos(
                        gamma * np.matmul(w, x_test) + b
                    )
                    pred = 1 if np.dot(alphas.T, h) >= 0 else -1
                    if pred == y_val[i]:
                        cnt += 1

                score = cnt / 200
                if score >= maxV:
                    maxV, maxG, maxP = score, gamma, penalty

                gamma *= 10

        data[D] = {
            'Penalty': maxP,
            'Gamma': maxG,
            'Validation': maxV
        }
    return data

python

The cross-validation results for each $D$ :

D	Best $\gamma$	Best penalty	Validation accuracy
100	0.01	1.0	95.5%
200	0.001	0.01	94.0%
300	0.01	0.01	98.0%
400	0.01	1.0	98.0%
500	0.01	1.0	98.5%
600	0.01	0.1	99.0%

Two things stand out. First, $\gamma = 0.01$ is optimal for nearly every $D$ . Second, accuracy goes up steadily with $D$ , and at $D = 600$ the random feature approximation actually matches or exceeds the exact kernel (98.1%). With 600 features, we’re approximating a 2000x2000 kernel matrix using a 600-dimensional linear model, and it works just as well.

The D=200 dip to 94% is probably a fluke of the random $w$ draw and the single validation fold. With more folds or a different random seed it would likely be higher.

The takeaway: random Fourier features give you the power of kernel methods at the cost of a linear model. For $D \ll n$ , that’s a big computational win. And the approximation gets better as you increase $D$ , converging to the exact kernel in the limit.

Why $L^2[0,1]$ is not an RKHS#

A reproducing kernel Hilbert space (RKHS) is a Hilbert space $\mathcal{H}$ of functions equipped with a kernel $k(\cdot, \cdot)$ such that evaluation at any point is a bounded linear functional. Formally, for all $f \in \mathcal{H}$ :

$f(x) = \langle k(x, \cdot), f(\cdot) \rangle$

This is called the reproducing property: the kernel “reproduces” function values via the inner product.

The space $L^2[0,1]$ is the space of square-integrable functions with the inner product:

$\langle f, g \rangle = \int_0^1 f(x) g(x) \, dx$

This is a Hilbert space. It satisfies symmetry ( $\langle f, g \rangle = \langle g, f \rangle$ ), bilinearity ( $\langle \alpha u + \beta v, w \rangle = \alpha \langle u, w \rangle + \beta \langle v, w \rangle$ ), and positive definiteness ( $\langle f, f \rangle \geq 0$ with equality iff $f = 0$ a.e.).

But it’s not an RKHS. The reason is that point evaluation is not a bounded functional in $L^2$ . If $L^2[0,1]$ had a reproducing kernel $k$ , then by the reproducing property:

$f(x) = \int_0^1 k(x, u) f(u) \, du$

The function that reproduces point evaluation is the Dirac delta $\delta(x - u)$ :

$f(x) = \int_0^1 \delta(x - u) f(u) \, du$

But the delta function is not square-integrable:

$\int_0^1 \delta(u)^2 \, du = \infty$

So $\delta \notin L^2[0,1]$ , which means there’s no function in the space that can serve as a reproducing kernel. Point evaluation in $L^2$ is not well-defined anyway, since $L^2$ functions are equivalence classes that can differ on sets of measure zero.

This is why we need the RKHS framework instead of just any Hilbert space of functions. In an RKHS, the kernel provides a “smooth” way to evaluate functions at points, and that smoothness is precisely what makes kernel methods work.

Smoothing splines and kernels#

Smoothing splines are a form of regularized regression that turns out to be connected to kernel methods.

A spline is a piecewise polynomial, where the pieces join at “knot points” with continuity conditions on the function and its derivatives. The degree of the polynomial determines the smoothness. Cubic splines (degree 3) are the most common: each piece is a cubic polynomial, and consecutive pieces share the same value, first derivative, and second derivative at their shared knot.

The problem with basic spline fitting is knot selection. Put too many knots and you overfit. Too few and you underfit. The choice of where to place them matters too.

Smoothing splines avoid knot selection entirely by placing a knot at every data point and then regularizing. The objective is:

$\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int (f''(x))^2 \, dx$

The first term is the data fit (squared error). The second term penalizes curvature: $f''(x)$ is the second derivative, and integrating its square penalizes wiggly functions. The parameter $\lambda$ controls the tradeoff. As $\lambda \to 0$ , we interpolate the data. As $\lambda \to \infty$ , we get a straight line (linear regression).

The connection to kernel methods: the solution to this optimization problem can be written as:

$f(x) = \sum_{i=1}^n \alpha_i k(x, x_i)$

where $k$ is a specific kernel determined by the penalty term. For the second-derivative penalty above, the corresponding kernel is related to the Green’s function of the fourth-order differential operator. The $\alpha_i$ coefficients are found by solving a regularized linear system, just like in kernel ridge regression.

From a practical standpoint, smoothing splines are simpler to use than general kernel methods because the only hyperparameter is $\lambda$ , which can be tuned by cross-validation. There’s no kernel choice or bandwidth parameter to worry about. And because the basis functions are localized (each spline basis function is nonzero only in a small interval), the kernel matrix has special structure that makes computation efficient.

Abstract#

The dataset#

Linear SVM#

Least squares classifier#

Gaussian kernel least squares#

Laplacian kernel#

Parameter sensitivity#

Random Fourier features#

Why L2[0,1]L^2[0,1]L2[0,1] is not an RKHS#

Smoothing splines and kernels#

Why $L^2[0,1]$ is not an RKHS#