NCC Template Matching, Stereo Depth, and KNN • TheAnig

Abstract#

Three problems. The first implements normalized cross-correlation template matching to find Stampy the elephant in a crowded Simpsons scene, using FFT convolution and integral images for speed. The second computes a stereo depth map from a pair of images by brute-force NCC at every pixel (four nested Python for-loops; it was slow). The third is k-nearest neighbors classification on a 2D dataset using sklearn.

Part 1: Template matching (NCC)#

The search image is a crowded scene from The Simpsons, and the template is a crop of “Stampy” the elephant. NCC works by sliding the template across the image and computing normalized cross-correlation at each position. High NCC = good match.

Load both and convert the search image to float64:

search = imread('./data/search.png')
template = imread('./data/template.png')
image = np.array(search, dtype=np.float64, copy=False)

python

Pad the image with zeros by the template shape on all sides so we can slide the template without going out of bounds.

pad_width = tuple((width, width) for width in template.shape)
image = np.pad(image, pad_width=pad_width, mode='constant', constant_values=0)

python

Efficient window sums#

Computing NCC at every position requires the sum and sum-of-squares of pixels in each template-sized window. I used cumulative sums along each axis (the integral image idea, extended to 3D since the image has color channels).

def calc_window_sum(image, window_shape):
    window_sum = np.cumsum(image, axis=0)
    window_sum = (window_sum[window_shape[0]:-1]
                  - window_sum[:-window_shape[0] - 1])
    window_sum = np.cumsum(window_sum, axis=1)
    window_sum = (window_sum[:, window_shape[1]:-1]
                  - window_sum[:, :-window_shape[1] - 1])
    window_sum = np.cumsum(window_sum, axis=2)
    window_sum = (window_sum[:, :, window_shape[2]:-1]
                  - window_sum[:, :, :-window_shape[2] - 1])
    return window_sum

python

Three passes of cumsum-and-subtract, one per axis.

image_window_sum = calc_window_sum(image, template.shape)
image_window_sum2 = calc_window_sum(image ** 2, template.shape)

python

Cross-correlation via FFT#

Cross-correlation in spatial domain is convolution with a flipped kernel. fftconvolve does it in the frequency domain, faster for large templates.

template_mean = template.mean()
template_volume = np.prod(template.shape)
template_ssd = np.sum((template - template_mean) ** 2)

xcorr = fftconvolve(image, template[::-1, ::-1, ::-1],
                    mode="valid")[1:-1, 1:-1, 1:-1]

python

[::-1, ::-1, ::-1] flips the template along all three axes, and the [1:-1, ...] trims an extra border pixel from the valid output.

NCC formula#

$\text{NCC} = \frac{\sum_{x,y} (I(x,y) - \bar{I}_w)(T(x,y) - \bar{T})}{\sqrt{\left(\sum I_w^2 - \frac{(\sum I_w)^2}{N}\right) \cdot \sum (T - \bar{T})^2}}$

The numerator and denominator are computed over the whole image at once:

numerator = xcorr - image_window_sum * template_mean

denominator = image_window_sum2
image_window_sum = np.multiply(image_window_sum, image_window_sum)
image_window_sum = np.divide(image_window_sum, template_volume)
denominator -= image_window_sum
denominator *= template_ssd
denominator = np.maximum(denominator, 0)  # clamp before sqrt
denominator = np.sqrt(denominator)

python

Division only happens where the denominator is above machine epsilon. Everything else stays zero.

response = np.zeros_like(xcorr, dtype=np.float64)
mask = denominator > np.finfo(np.float64).eps
response[mask] = numerator[mask] / denominator[mask]

python

Then crop the response back to the original image dimensions:

slices = []
for i in range(template.ndim):
    d0 = template.shape[i] - 1
    d1 = d0 + image_shape[i] - template.shape[i] + 1
    slices.append(slice(d0, d1))
result = response[tuple(slices)]

python

Finding the top matches#

np.argpartition is faster than a full sort when you only need the top N:

def largest_indices(ary, n):
    """Returns the n largest indices from a numpy array."""
    flat = ary.flatten()
    indices = np.argpartition(flat, -n)[-n:]
    indices = indices[np.argsort(-flat[indices])]
    return np.unravel_index(indices, ary.shape)

python

I tested the 1st, 2nd, 5th, 10th, 100th, and 500th best matches. One thing that bit me here:

ij = np.unravel_index(np.argmax(result), result.shape)
_, x, y = ij[::-1]
# IMPORTANT TO SWITCH ROWS AND COLUMNS
# I've LOST 5 points this semester because of this

python

NumPy’s unravel_index returns (row, col) order but matplotlib’s Rectangle and image indexing expect (x, y). I kept forgetting to swap them and losing points on earlier assignments. By HW9 I’d finally burned this into my brain enough to leave myself an all-caps comment.

Template matching results for Stampy in the Simpsons scene — Left column: the detected region cropped from the search image. Right column: full image with a red rectangle at the match location. Rows correspond to the 1st, 2nd, 5th, 10th, 100th, and 500th closest matches.

The top 1 through 10 matches all land on Stampy almost pixel-for-pixel. The 2nd and 5th closest are only off by 1-2 pixels, basically indistinguishable. By the 100th match, it drifts too far left: you can see more of John Swartzwelder (the guy in the suit) and Brandine (the lady on the right) has disappeared from the crop. By the 500th match, we’re closer to Homer than to Stampy. Way off.

Part 2: Stereo depth via template matching#

Given a left and right image of the same scene (grayscale stereo pair), compute a depth map. Closer objects have larger horizontal disparity between the two views, so if you can measure that disparity per-pixel, you get depth.

left = imread('./data/left.png')
right = imread('./data/right.png')

python

Brute force: for every pixel in the left image, extract a window around it, slide that window horizontally across the right image (up to max_offset pixels), and compute NCC at each offset. Whichever offset gives the highest NCC is the estimated disparity.

def stereo_match(left_img, right_img, kernel=11, max_offset=50):
    w, h = left_img.shape
    depth = np.zeros((w, h), np.uint8)
    depth.shape = h, w
    kernel_half = kernel // 2
    offset_adjust = 255 / max_offset

    for y in range(kernel_half, h - kernel_half):
        for x in range(kernel_half, w - kernel_half):
            best_offset = 0
            prev_ncc = float("-inf")

            for offset in range(max_offset):
                ncc = 0
                lwindow = left[y-kernel_half:y+kernel_half,
                               x-kernel_half:x+kernel_half]
                rwindow = right[y-kernel_half:y+kernel_half,
                                x-kernel_half:x+kernel_half]
                lwindow_mean = lwindow.mean()
                lwindow_std = np.std(lwindow)
                rwindow_mean = rwindow.mean()
                rwindow_std = np.std(rwindow)

                for v in range(-kernel_half, kernel_half):
                    for u in range(-kernel_half, kernel_half):
                        ncc_temp = ((int(left[y+v, x+u]) - lwindow_mean)
                                    * (int(right[y+v, (x+u) - offset])
                                       - rwindow_mean))
                        if lwindow_std == 0 or rwindow_std == 0:
                            ncc_temp = 0
                        else:
                            ncc_temp /= lwindow_std * rwindow_std
                        ncc += ncc_temp

                if ncc > prev_ncc:
                    prev_ncc = ncc
                    best_offset = offset

            depth[y, x] = best_offset * offset_adjust

    return depth

python

Four nested Python for-loops. It ran, but it was slow. I probably should have vectorized the inner loops with NumPy, or at least the kernel summation. But it did produce a depth map.

Stereo depth map — Depth map computed from the stereo pair. Brighter pixels indicate larger disparity (closer to the camera).

You can make out the shape of a person. Background is darker (smaller disparity, farther away), the figure is brighter.

Part 3: k-Nearest Neighbors#

2D points with class labels, two classes. Train on one set, predict on another.

train = np.loadtxt('./data/train.txt')
test = np.loadtxt('./data/test.txt')
X_train, y_train = train[:, 0:2], train[:, 2]
X_test, y_test = test[:, 0:2], test[:, 2]

python

I used sklearn.neighbors.KNeighborsClassifier and tested $k = 1, 5, 11, 15$ :

neighbours = [1, 5, 11, 15]

f, axarr = plt.subplots(2, 2, sharex='col', sharey='row', dpi=150)
for n in neighbours:
    idx = neighbours.index(n)
    model = KNeighborsClassifier(n_neighbors=n)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

    axarr[idx//2, idx%2].scatter(X_test[:, 0], X_test[:, 1],
                                  c=y_pred, cmap=cmap_bold, s=1)
    axarr[idx//2, idx%2].scatter(
        X_test[y_pred != y_test, 0],
        X_test[y_pred != y_test, 1],
        c='#00FF00', edgecolor='k', cmap=cmap_bold, s=10)
    axarr[idx//2, idx%2].set_title(
        f"$k$={n}, Acc={round(model.score(X_test, y_test)*100, 2)}%")
plt.show()

python

KNN classification results — Scatter plots for k=1, 5, 11, 15. Red and blue are predicted classes. Green dots with black edges are misclassifications.

Accuracies: $k=1$ is 96.77%, $k=5$ is 96.33%, $k=11$ is 96.27%, $k=15$ is 96.9%.

The errors are all at the boundary between the two classes, where the nearest neighbors are a mix of both. Accuracy dips going from $k=1$ to $k=11$ , then jumps back up at $k=15$ . I think at small $k$ values, adding neighbors pulls in wrong-class points right at the boundary. At $k=15$ you’re sampling enough of the local area that the majority vote is more representative.

Not much code to write for this part since sklearn does all the work.