Logistic Regression from Scratch on MNIST • TheAnig

Abstract#

Binary logistic regression implemented from scratch with NumPy on an MNIST-like dataset (28x28 images, 784 features, ~6000 training samples). The gradient computation needed a numerical stability fix to avoid overflow. Tuned the learning rate over 10 values and picked the one with the lowest validation risk.

This was the first assignment where we had to implement an actual optimization loop instead of plugging into a closed-form solution. The dataset was a binary subset of MNIST — 28x28 pixel images flattened to 784-dimensional vectors, with labels $+1$ and $-1$ . About 6100 training samples, with separate validation and test splits provided.

The logistic loss#

The logistic regression risk function is:

$R(w) = \frac{1}{N} \sum_{i=1}^{N} \ln\left(1 + \exp(-y_i \cdot w^T x_i)\right)$

This is the average log-loss. When $y_i \cdot w^T x_i$ is large and positive (correct prediction with high confidence), the loss for that sample is near zero. When it’s negative (wrong prediction), the loss grows linearly.

Gradient computation#

The gradient of the logistic loss with respect to $w$ is:

$\nabla R(w) = \frac{1}{N} \sum_{i=1}^{N} \frac{-y_i \cdot \exp(-y_i \cdot w^T x_i)}{1 + \exp(-y_i \cdot w^T x_i)} \cdot x_i$

The problem is that $\exp(-y_i \cdot w^T x_i)$ can overflow when the product is a large negative number. With 784 features and float128 arithmetic, the dot products get big fast in early iterations when the weights are moving a lot.

The fix is to split the computation based on the sign of the product:

def calculate_gradient(w):
    gradient = np.zeros((1, 784), dtype=np.float128)
    for i in range(N):
        xi = trainingData[i]
        yi = trainingLabels[i]
        product = yi * np.dot(w, xi.T)

        if product >= 0:
            scalar = (-yi / N) * (np.exp(-product) / (1 + np.exp(-product)))
        else:
            scalar = (-yi / N) * (1.0 / (1 + np.exp(product)))

        gradient += scalar * xi
    return gradient

python

When product >= 0, $\exp(-\text{product})$ is at most 1, so no overflow. When product < 0, we rewrite the fraction by multiplying numerator and denominator by $\exp(\text{product})$ , which gives $1 / (1 + \exp(\text{product}))$ . Now $\exp(\text{product})$ is also at most 1. Same value, no overflow either way.

Gradient descent#

Standard gradient descent. Initialize $w$ as zeros, iterate:

def gradient_descent(T, learning_rate):
    w = np.zeros((1, 784), dtype=np.float128)
    for t in range(T):
        grad = calculate_gradient(w)
        w = w - learning_rate * grad
    return w

python

No momentum, no adaptive learning rate, no mini-batches. Full-batch gradient descent with a fixed step size. It’s slow (iterating over all 6107 samples per step, per sample), but the assignment wasn’t about speed.

Learning rate tuning#

I tested 10 learning rates with $T = 1000$ iterations each:

Learning Rate	Training Risk	Validation Risk
1e-6	0.1381	0.1454
5e-6	0.1125	0.1287
1e-5	0.1033	0.1260
2e-5	0.0946	0.1263
2.5e-5	0.0919	0.1270
3e-5	0.1207	0.1711
3.5e-5	0.0870	0.1328
5e-5	0.0866	0.1516
1e-4	0.1314	0.2754
1e-3	3.3060	5.2962

Learning rate 1e-5 had the lowest validation risk at 0.126. The training risk keeps going down with larger learning rates (down to 0.087 at 3.5e-5), but the validation risk starts climbing back up after 1e-5. That gap between training and validation risk is overfitting. The model fits the training data better but generalizes worse.

At 1e-3 everything blows up. The step size is too large and gradient descent overshoots, oscillating instead of converging. The risk of 5.3 on validation is worse than random guessing.

Empirical risk decreasing over iterations — Empirical risk vs. iteration count for the best learning rate (1e-5). Monotonically decreasing, as expected for a convex loss with a small enough step size.

The risk curve is smooth and monotonically decreasing. Logistic loss is convex, so with a small enough learning rate you’re guaranteed to converge. No local minima to worry about.

Final model#

Using learning rate 1e-5 and $T = 1000$ :

Validation risk: 0.126
Test risk: 0.119

The test risk is slightly lower than the validation risk. Since I picked the learning rate based on validation performance, you’d expect validation to be slightly optimistic (we selected the model that happened to do best on validation), so the test risk being even lower is just noise from finite sample sizes. The two numbers are close enough that the model generalizes well.

For a linear model with no regularization on 784 features, this is a reasonable result. The logistic regression can only learn a linear decision boundary in the 784-dimensional pixel space, so there’s a ceiling on how well it can separate the two digit classes. The SVM models in the next assignment push that ceiling higher.