Speech Recognition with MFCC and DTW • TheAnig

Abstract#

Extract MFCC features from audio recordings of spoken digits and classify them using dynamic time warping. Compare MFCC against raw log-spectrum features.

This was the final lab for the speech class, and it brought together a lot of the signal processing ideas from earlier in the course. The task was template-based speech recognition: given a set of template recordings (one per digit, spoken by speaker A), classify new recordings (same digits, spoken by speaker B) by finding the closest template. The feature extraction pipeline goes from raw WAV audio to mel-frequency cepstral coefficients, and the matching uses dynamic time warping to handle the fact that two people say the same word at different speeds.

The mel scale#

Human hearing doesn’t perceive frequency linearly. The difference between 100 Hz and 200 Hz sounds like a big jump, but the difference between 5000 Hz and 5100 Hz is barely noticeable. The mel scale captures this by warping the frequency axis so that equal intervals on the mel scale correspond to equal perceived pitch differences:

$m = 1127 \ln\left(1 + \frac{f}{700}\right)$

And the inverse:

$f = 700 \left(e^{m/1127} - 1\right)$

In the implementation, these are just two short functions:

def mel(f):
    return 1127 * math.log(1 + f / 700)

def inv_mel(m):
    return 700 * (math.exp(m / 1127) - 1)

python

The mel scale matters because we want our filterbanks (explained below) to be spaced according to how humans actually hear, not according to raw physics.

Feature extraction pipeline#

The feature extraction starts with a raw WAV file and produces a sequence of 36-dimensional feature vectors, one per frame. Here’s how each frame gets processed.

Windowing and FFT#

Audio is a continuous signal, but we need to analyze it in small chunks. We slide a 512-sample window across the waveform with a shift of 20 samples between frames. Each window gets multiplied by a Hamming function to smooth the edges (this reduces spectral leakage from the FFT), and then we take the FFT to get the frequency content:

hamming = np.hamming(len(window))
spectrum = np.fft.fft(window * hamming)
power_spectrum = spectrum.real ** 2 + spectrum.imag ** 2

python

The power spectrum tells us how much energy is at each frequency.

Filterbanks#

Next, we apply a bank of triangular filters to the power spectrum. The filters are arranged in two regions:

Below 1000 Hz: 10 linearly spaced filters, each 100 Hz wide
Above 1000 Hz: 17 mel-spaced filters, spread from 1000 Hz to the Nyquist frequency

The split is intentional. Below 1000 Hz, human hearing is roughly linear, so linear spacing works fine. Above 1000 Hz, perception becomes more logarithmic, and mel spacing matches that.

Each triangular filter gets multiplied element-wise with the power spectrum, and then we take the sum of log powers in that band. The result is 27 values (10 linear + 17 mel) that represent a compressed version of the spectrum shaped to match human perception.

linear_bounds = np.arange(0, linear_cutoff + 1, 100) * unit
log_mel_bounds = np.linspace(mel(linear_cutoff), mel(sample_freq / 2), num=18)
log_bounds = np.array([inv_mel(m) * unit for m in log_mel_bounds])

all_bounds = np.concatenate([linear_bounds, log_bounds[1:]])
for i in range(len(all_bounds) - 2):
    upper = int(all_bounds[i + 2])
    lower = int(all_bounds[i])
    filter_bank = signal.triang(upper - lower)
    power_spec = filter_bank * power_spectrum[lower:upper]
    mel_filterbanks.append(np.sum(np.log(power_spec)))

python

DCT and cepstral coefficients#

The log-filterbank energies are correlated with each other because neighboring filters overlap. The Discrete Cosine Transform decorrelates them, and we keep only the first 12 coefficients:

def mfcc(window, sample_freq):
    mel_filterbanks = _log_spectrum(window, sample_freq)
    cepstrum = scipy.fftpack.dct(mel_filterbanks)
    return cepstrum[:12]

python

These 12 numbers are the MFCCs for a single frame. They encode the shape of the spectral envelope, which is closely related to the vocal tract configuration that produced the sound. The lower-order coefficients capture broad spectral shape (like whether the mouth is open or closed), while higher-order ones capture finer detail.

Why only 12? The DCT produces as many coefficients as input values, but the higher coefficients represent fast variations in the spectrum that correspond more to pitch harmonics than to the overall spectral shape. For speech recognition, the envelope is what matters, so we throw away the fine structure.

Delta and double-delta features#

Static MFCCs capture what the spectrum looks like at a single instant, but speech is all about how sounds change over time. The transition from one phoneme to the next carries a lot of the information we use to distinguish words.

Delta features capture this by computing the slope of each MFCC across neighboring frames:

$\Delta_t = \frac{\sum_{n=1}^{N} n \cdot (c_{t+n} - c_{t-n})}{2 \sum_{n=1}^{N} n^2}$

where $c_t$ is the cepstral vector at frame $t$ and $N$ is the window size (we used $N = 2$ ). Double-delta features are just the deltas of the deltas, capturing acceleration.

def delta(feat, N):
    NUMFRAMES = len(feat)
    denominator = 2 * sum([i**2 for i in range(1, N+1)])
    delta_feat = np.empty_like(feat)
    padded = np.pad(feat, ((N, N), (0, 0)), mode='edge')
    for t in range(NUMFRAMES):
        delta_feat[t] = np.dot(
            np.arange(-N, N+1), padded[t : t+2*N+1]) / denominator
    return delta_feat

python

Stacking the 12 MFCCs, 12 deltas, and 12 double-deltas gives a 36-dimensional feature vector per frame. The get_mfcc_vectors function does this concatenation:

def get_mfcc_vectors(x, sf, winsize, shift):
    features = np.array(
        _get_transformed_vector(x, sf, winsize, shift, mfcc))
    delta_features = delta(features, 2)
    double_delta_features = delta(delta_features, 2)
    return np.concatenate(
        (features, delta_features, double_delta_features), axis=1)

python

Dynamic time warping#

Now we have two sequences of feature vectors: one from the template recording and one from the test recording. Even though both might say “seven”, they’ll have different lengths because people speak at different rates. We can’t just compare them frame-by-frame.

Dynamic time warping finds the optimal alignment between two sequences by warping the time axis. It builds a cost matrix $D$ where $D[i][j]$ is the minimum cumulative distance to align the first $i$ frames of sequence $X$ with the first $j$ frames of sequence $Y$ .

The recurrence is:

$D[i][j] = d(x_i, y_j) + \min(D[i-1][j],\ D[i-1][j-1],\ D[i][j-1])$

where $d(x_i, y_j)$ is the Euclidean distance between frame $i$ of $X$ and frame $j$ of $Y$ . The three options in the min correspond to three moves: advance only in $X$ (vertical), advance in both (diagonal), or advance only in $Y$ (horizontal).

def dtw(x, y, dist):
    l_x = len(x)
    l_y = len(y)
    d = np.zeros((l_x, l_y), dtype=float)

    d[0][0] = dist(x[0], y[0])
    for i in range(1, l_x):
        d[i][0] = dist(x[i], y[0]) + d[i - 1][0]
    for j in range(1, l_y):
        d[0][j] = dist(y[j], x[0]) + d[0][j - 1]

    for i in range(1, l_x):
        for j in range(1, l_y):
            paths = (d[i-1][j], d[i-1][j-1], d[i][j-1])
            d[i][j] = min(paths) + dist(x[i], y[j])

    return d[l_x - 1][l_y - 1]

python

The final DTW cost $D[N-1][M-1]$ tells us how similar the two utterances are. To classify a test recording, we compute the DTW cost against every template and pick the one with the lowest cost.

Classification results#

The digit set had 11 recordings from speaker A (used as templates) and 11 from speaker B (used as test). The digits were the standard telephone digits: 0 through 9 plus “oh” (the letter O, commonly used as a spoken zero).

MFCC features (36-dimensional)#

MFCC got 10 out of 11 correct. The one mistake was classifying “4b” as “oh-a”. According to the writeup, both “4b” and “oh-a” had similar spectral patterns: one high-energy region followed by a tail of decreasing energy. The template for “4a” had two distinct high-energy regions, which is a better match for how “four” actually sounds (the “f” onset and the “or” vowel are quite different in energy). Speaker B apparently pronounced “four” in a way that collapsed those two regions.

Log-spectrum features (27-dimensional)#

Log-spectrum, which skips the DCT step and uses the raw log-filterbank energies, got 11 out of 11 correct. This was a bit surprising since MFCCs are generally considered the better feature for speech recognition. The likely explanation is that with only 11 test cases, one misclassification swings the accuracy by 9 percentage points. On a larger dataset, MFCCs would probably do better.

The DTW scores also showed a clear separation between matching and non-matching pairs. For MFCC, matching pairs had costs around 400,000-500,000 while non-matching pairs were around 600,000-700,000. For log-spectrum, matching pairs scored 50,000-60,000 and non-matching pairs scored 80,000-90,000. The absolute numbers differ because the feature dimensionality is different, but both representations produce a usable gap between correct and incorrect matches.

Cross-speaker performance#

As an extension, we tried using our own voice recordings against the speaker A templates. This went poorly. Using one team member’s recordings, only digits 5 and 6 were correctly classified using MFCC. The speaker A templates were all from a female speaker, and we suspected the gender mismatch was a big part of the problem. Different vocal tract lengths produce systematically different spectral shapes, and a template-based system with raw Euclidean distance has no way to normalize for that.

Using all available recordings from multiple speakers as templates didn’t help much either. With that setup, only 1 out of 11 digits was correctly classified, and all predictions pointed to templates from the same speaker (Matias). The system was essentially finding the speaker whose voice was closest in raw feature space, rather than finding the correct digit.

This is a known limitation of template-based DTW recognition. Real speech recognition systems use statistical models (like HMMs or neural networks) that can learn speaker-independent representations. Template matching works fine when the training and test speakers are the same person (or very similar), but it falls apart when there’s significant acoustic variability.

The full pipeline#

Putting it all together, the system works as a command-line tool:

# MFCC features
./main.py --templates digits/*a.wav --predict digits/*b.wav --transformation 0

# Log-spectrum features
./main.py --templates digits/*a.wav --predict digits/*b.wav --transformation 1

bash

For each test file, it computes the feature representation, runs DTW against every template, and prints the template with the lowest cost as the prediction. The whole thing runs in a few seconds for 11 templates and 11 test files, though the $O(NM)$ DTW cost would become a bottleneck with more data.