TheAnig

Back

Gaussian kernel SVM classification with random Fourier feature approximation
Gaussian kernel SVM classification with random Fourier feature approximation
Gaussian kernel SVM classification with random Fourier feature approximation

Abstract#

Linear SVM vs least squares classification, kernel methods with Gaussian and Laplacian kernels, random Fourier features for kernel approximation, a proof that L2[0,1]L^2[0,1] is not an RKHS, and a discussion of smoothing splines as kernel methods.

This was the second homework for statistical machine learning, and the problems all orbit around the same question: how do you get a nonlinear classifier out of a linear framework? The dataset is MNIST digits 7 vs 9 (2000 training images, 2000 test images, each 28x28 = 784 pixels). We start with a linear SVM baseline, build up to kernel methods, then approximate the kernel with random features, and finish with some theory on reproducing kernel Hilbert spaces and smoothing splines.

The dataset#

The 79.mat file contains 2000 training images from MNIST: 1000 sevens and 1000 nines, each flattened to a 784-dimensional vector. The test set test79.mat has the same structure. Labels are encoded as +1+1 for nines and 1-1 for sevens (or 7/9 directly, depending on which part of the code you’re looking at).

import scipy.io
import numpy as np

train = scipy.io.loadmat('79.mat')
test = scipy.io.loadmat('test79.mat')

X = np.array(train['d79'])       # (2000, 784)
test_X = np.array(test['d79'])   # (2000, 784)

y = np.append(np.ones(1000) * 7, np.ones(1000) * 9)
python

Linear SVM#

The simplest baseline. Scikit-learn’s LinearSVC with default parameters:

from sklearn.svm import LinearSVC

clf = LinearSVC(random_state=0)
clf.fit(X, y)
print(clf.score(test_X, y))  # 0.926
python

92.6% accuracy. Not bad for a linear model on raw pixels, but that means about 148 test images get misclassified. Sevens and nines do look similar (especially sloppy handwriting), so a linear boundary in 784-dimensional pixel space has limits.

Least squares classifier#

Instead of SVM, we can frame classification as a least squares problem. Encode the labels as one-hot vectors: [1,1][1, -1] for sevens and [1,1][-1, 1] for nines. Then solve for a weight matrix WW that maps augmented inputs x~=[1,x]\tilde{x} = [1, x] to these label vectors.

The objective is:

W=argminWi=1nx~iTWyi2W^* = \arg\min_W \sum_{i=1}^n \|\tilde{x}_i^T W - y_i\|^2

In closed form: W=(XTX)1XTYW = (X^T X)^{-1} X^T Y. But XTXX^T X might be singular (784 features, 2000 samples), so we add Tikhonov regularization:

W=(XTX+σI)1XTYW = (X^T X + \sigma I)^{-1} X^T Y

with σ=0.001\sigma = 0.001:

Prediction takes the class with the higher output value:

def predict(W, x):
    x = np.append(1, x)
    values = list(np.dot(W.T, x))
    winner = values.index(max(values))
    y = [0 for _ in values]
    y[winner] = 1
    return y
python

Result: 93.85% accuracy (1877 out of 2000 correct). The least squares classifier beats the linear SVM by about 1.2 percentage points. Both are linear models, but they optimize different loss functions. SVM uses hinge loss and finds the maximum-margin hyperplane; least squares minimizes squared error to one-hot targets. On this particular dataset, the squared error approach works a bit better.

Gaussian kernel least squares#

Linear classifiers top out around 93%. To go further, we need nonlinearity. The kernel trick replaces the dot product xiTxjx_i^T x_j with a kernel function k(xi,xj)k(x_i, x_j) that implicitly maps the data to a higher-dimensional space where a linear boundary might work better.

The Gaussian (RBF) kernel:

k(x,y)=exp(xy22σ2)k(x, y) = \exp\left(-\frac{\|x - y\|^2}{2\sigma^2}\right)

The kernel least squares classifier works by building a kernel matrix KK where Kij=k(xi,xj)K_{ij} = k(x_i, x_j), then solving for coefficients α\alpha:

α=(K+λI)1y\alpha = (K + \lambda I)^{-1} y

where λ\lambda is a regularization parameter. At prediction time, for a new point xx:

f(x)=j=1nαjk(xj,x)f(x) = \sum_{j=1}^n \alpha_j \cdot k(x_j, x)

and the sign of f(x)f(x) gives the predicted class.

from sklearn.preprocessing import scale

def gaussian_kernel(xi, xj, sigma=2.5):
    return np.exp(-np.linalg.norm(xi - xj)**2 * 0.5 / sigma)

def gauss_lss(X, y, penalty=1e-4, sigma=1e+4):
    X = scale(X)
    N = X.shape[0]
    K = np.zeros((N, N))
    for i in range(N):
        for j in range(N):
            K[i, j] = gaussian_kernel(X[i,:], X[j,:], sigma)

    KI = np.linalg.inv(np.add(K, penalty * np.identity(N)))
    return np.matmul(KI, y)
python

With λ=104\lambda = 10^{-4} and σ=104\sigma = 10^4, this gets 98.1% accuracy. That’s a big jump from the linear models.

Laplacian kernel#

The Laplacian kernel uses L1 norm instead of L2:

k(x,y)=exp(xyσ)k(x, y) = \exp\left(-\frac{\|x - y\|}{\sigma}\right)

The implementation is almost identical, just swap out the norm calculation:

def laplace_kernel(xi, xj, sigma=2.5):
    return np.exp(-np.linalg.norm(xi - xj) / sigma)
python

Result: 97.95% accuracy. Basically the same as the Gaussian kernel on this dataset. The Laplacian kernel produces a rougher feature space (it corresponds to a Matern kernel with ν=1/2\nu = 1/2), but for 7-vs-9 classification the difference is negligible.

Parameter sensitivity#

Kernel methods have two hyperparameters to tune: λ\lambda (regularization) and σ\sigma (kernel bandwidth).

Setting λ=108\lambda = 10^{-8} (weaker regularization) drops accuracy to 97.75%. The model fits training data more tightly but generalizes slightly worse. The optimal λ=104\lambda = 10^{-4} was found through cross-validation.

The bandwidth σ\sigma controls how “wide” the kernel is. If σ\sigma is too small, each point only “sees” its immediate neighbors, and the classifier becomes overly local. If σ\sigma is too large, every point looks similar to every other point, and the kernel matrix approaches a matrix of all ones, throwing away all structure.

σ\sigmaAccuracy
10097.85%
1,00098.25%
10,00098.1%
100,00095.8%

The sweet spot is around σ=1000\sigma = 1000. At σ=100000\sigma = 100000, the kernel devalues each pairwise distance so much that the contributions become too uniform to be useful.

Random Fourier features#

The kernel matrix approach has a problem: for nn training points, you need an n×nn \times n matrix (2000×2000=42000 \times 2000 = 4 million entries) and its inverse. That doesn’t scale. Random Fourier features (Rahimi and Recht, 2007) approximate the kernel with a finite-dimensional explicit feature map, turning the kernel method back into a linear one in a new feature space.

The idea: for a shift-invariant kernel like the Gaussian, Bochner’s theorem guarantees that the kernel can be written as:

k(x,y)=Ew[cos(wT(xy))]k(x, y) = E_w[\cos(w^T(x - y))]

where ww is drawn from the Fourier transform of the kernel. We approximate this expectation by sampling DD random vectors w1,,wDw_1, \ldots, w_D and biases b1,,bDb_1, \ldots, b_D:

z(x)=cos(γwTx+b)z(x) = \cos(\gamma w^T x + b)

where wRD×dw \in \mathbb{R}^{D \times d}, bUniform(0,2π)b \sim \text{Uniform}(0, 2\pi), and γ\gamma is a scaling parameter. Then z(x)Tz(y)k(x,y)z(x)^T z(y) \approx k(x, y) for large enough DD.

Once we have z(x)z(x) for each training point, it’s just a linear least squares problem in DD dimensions instead of an n×nn \times n kernel problem:

(ZZT+λI)α=Zy(Z Z^T + \lambda I) \alpha = Z y

The cross-validation results for each DD:

DBest γ\gammaBest penaltyValidation accuracy
1000.011.095.5%
2000.0010.0194.0%
3000.010.0198.0%
4000.011.098.0%
5000.011.098.5%
6000.010.199.0%

Two things stand out. First, γ=0.01\gamma = 0.01 is optimal for nearly every DD. Second, accuracy goes up steadily with DD, and at D=600D = 600 the random feature approximation actually matches or exceeds the exact kernel (98.1%). With 600 features, we’re approximating a 2000x2000 kernel matrix using a 600-dimensional linear model, and it works just as well.

The D=200 dip to 94% is probably a fluke of the random ww draw and the single validation fold. With more folds or a different random seed it would likely be higher.

The takeaway: random Fourier features give you the power of kernel methods at the cost of a linear model. For DnD \ll n, that’s a big computational win. And the approximation gets better as you increase DD, converging to the exact kernel in the limit.

Why L2[0,1]L^2[0,1] is not an RKHS#

A reproducing kernel Hilbert space (RKHS) is a Hilbert space H\mathcal{H} of functions equipped with a kernel k(,)k(\cdot, \cdot) such that evaluation at any point is a bounded linear functional. Formally, for all fHf \in \mathcal{H}:

f(x)=k(x,),f()f(x) = \langle k(x, \cdot), f(\cdot) \rangle

This is called the reproducing property: the kernel “reproduces” function values via the inner product.

The space L2[0,1]L^2[0,1] is the space of square-integrable functions with the inner product:

f,g=01f(x)g(x)dx\langle f, g \rangle = \int_0^1 f(x) g(x) \, dx

This is a Hilbert space. It satisfies symmetry (f,g=g,f\langle f, g \rangle = \langle g, f \rangle), bilinearity (αu+βv,w=αu,w+βv,w\langle \alpha u + \beta v, w \rangle = \alpha \langle u, w \rangle + \beta \langle v, w \rangle), and positive definiteness (f,f0\langle f, f \rangle \geq 0 with equality iff f=0f = 0 a.e.).

But it’s not an RKHS. The reason is that point evaluation is not a bounded functional in L2L^2. If L2[0,1]L^2[0,1] had a reproducing kernel kk, then by the reproducing property:

f(x)=01k(x,u)f(u)duf(x) = \int_0^1 k(x, u) f(u) \, du

The function that reproduces point evaluation is the Dirac delta δ(xu)\delta(x - u):

f(x)=01δ(xu)f(u)duf(x) = \int_0^1 \delta(x - u) f(u) \, du

But the delta function is not square-integrable:

01δ(u)2du=\int_0^1 \delta(u)^2 \, du = \infty

So δL2[0,1]\delta \notin L^2[0,1], which means there’s no function in the space that can serve as a reproducing kernel. Point evaluation in L2L^2 is not well-defined anyway, since L2L^2 functions are equivalence classes that can differ on sets of measure zero.

This is why we need the RKHS framework instead of just any Hilbert space of functions. In an RKHS, the kernel provides a “smooth” way to evaluate functions at points, and that smoothness is precisely what makes kernel methods work.

Smoothing splines and kernels#

Smoothing splines are a form of regularized regression that turns out to be connected to kernel methods.

A spline is a piecewise polynomial, where the pieces join at “knot points” with continuity conditions on the function and its derivatives. The degree of the polynomial determines the smoothness. Cubic splines (degree 3) are the most common: each piece is a cubic polynomial, and consecutive pieces share the same value, first derivative, and second derivative at their shared knot.

The problem with basic spline fitting is knot selection. Put too many knots and you overfit. Too few and you underfit. The choice of where to place them matters too.

Smoothing splines avoid knot selection entirely by placing a knot at every data point and then regularizing. The objective is:

minfi=1n(yif(xi))2+λ(f(x))2dx\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int (f''(x))^2 \, dx

The first term is the data fit (squared error). The second term penalizes curvature: f(x)f''(x) is the second derivative, and integrating its square penalizes wiggly functions. The parameter λ\lambda controls the tradeoff. As λ0\lambda \to 0, we interpolate the data. As λ\lambda \to \infty, we get a straight line (linear regression).

The connection to kernel methods: the solution to this optimization problem can be written as:

f(x)=i=1nαik(x,xi)f(x) = \sum_{i=1}^n \alpha_i k(x, x_i)

where kk is a specific kernel determined by the penalty term. For the second-derivative penalty above, the corresponding kernel is related to the Green’s function of the fourth-order differential operator. The αi\alpha_i coefficients are found by solving a regularized linear system, just like in kernel ridge regression.

From a practical standpoint, smoothing splines are simpler to use than general kernel methods because the only hyperparameter is λ\lambda, which can be tuned by cross-validation. There’s no kernel choice or bandwidth parameter to worry about. And because the basis functions are localized (each spline basis function is nonzero only in a small interval), the kernel matrix has special structure that makes computation efficient.

Kernel Methods and Random Fourier Features
https://theanig.dev/blog/sml-hw2-kernel-methods-and-random-features
Author Anirudh Ganesh
Published at June 25, 2020