TheAnig

Back

Weighted finite-state transducer diagram for speech recognition
Weighted finite-state transducer diagram for speech recognition
Weighted finite-state transducer diagram for speech recognition

Abstract#

Build finite-state transducers for two very different problems: recognizing poker hands from a sorted sequence of cards, and decoding spoken digits from a phone sequence through WFST composition.

This lab had two parts that both used finite-state transducers, but for completely different domains. The first problem was pattern matching: given a sorted hand of five playing cards, classify it as four-of-a-kind, three-of-a-kind, or full house. The second was speech recognition: build a cascade of weighted FSTs that maps phone sequences to digit words.

Part A: Poker hand recognition#

The setup is simple. You have a hand of five cards, sorted by rank from lowest to highest. The card vocabulary is EIGHT through ACE (seven ranks total, no suits). The FST reads the five cards as input symbols and outputs the hand classification on the final transition.

Four-of-a-kind#

For four-of-a-kind, there are only two patterns to handle for each rank. Either the four matching cards come first and a higher kicker follows (like EIGHT EIGHT EIGHT EIGHT NINE), or a lower card comes first and four matching cards follow (EIGHT NINE NINE NINE NINE). The cards are sorted, so you don’t need to worry about arbitrary orderings.

Each path through the FST consumes exactly five cards. The output symbol is - (epsilon) for the first four transitions and FOUR-OF-A-KIND on the fifth. The generated FST has 91 lines.

Three-of-a-kind and full house#

This one is more involved because there are more patterns to enumerate. For three-of-a-kind, you need to handle three cases depending on where the triple sits in the sorted hand:

  • X Y Y Y Z (one card before the triple, one after)
  • Y Y Y X Z (triple first, then two different cards)
  • X Z Y Y Y (two different cards, then the triple)

A full house is when the remaining two cards form a pair. So patterns like X X Y Y Y or Y Y Y X X. The script uses itertools.combinations to enumerate the valid card arrangements for each base rank.

The three-of-a-kind FST uses two accepting states: state 1 for THREE-OF-A-KIND and state 2 for FULL-HOUSE. The generated file is 587 lines, much larger than the four-of-a-kind one because there are more combinations to cover.

Part B: Isolated digit speech recognition#

The second problem is a weighted finite-state transducer pipeline for recognizing spoken digits. The idea is to decompose the recognition problem into three separate transducers and compose them together: a duration model, a pronunciation dictionary, and a language model.

The final pipeline is: dur ∘ dict ∘ lm

Duration transducer#

The duration model maps repeated phone observations to phone labels. The phone inventory has 21 phones (including SIL for silence):

ah, ao, ax, ay, eh, ey, f, ih, iy, k, n, ow, r, s, SIL, t, th, uw, v, w, z
plaintext

For each phone, the transducer enforces a minimum duration of 3 frames. A phone has to appear at least three times in a row before it gets emitted as output. This is a simple way to model the fact that speech sounds have some minimum duration.

The first two occurrences of a phone produce epsilon output. The third occurrence emits the phone symbol. After that, a self-loop at the third state accepts additional repetitions without output. Transitions to any other phone’s first state allow phone changes. The generated file is 568 lines with 64 states.

Dictionary transducer#

The dictionary transducer maps phone sequences to words. The pronunciation dictionary has 11 entries:

ONE     w ah n
TWO     t uw
THREE   th r iy
FOUR    f ao r
FIVE    f ay v
SIX     s ih k s
SEVEN   s eh v ax n
EIGHT   ey t
NINE    n ay n
ZERO    z iy r ow
OH      ow
plaintext

The transducer reads phones as input. For each word, it creates a path through the phone sequence and emits the word label on the final phone. SIL (silence) transitions loop back to the start state between words.

Composing the pipeline#

The language model is a simple FSA that accepts any single digit word. The full recognition pipeline composes all three:

  1. Duration consumes raw phone frames and outputs phone labels (with minimum duration enforced)
  2. Dictionary consumes phone labels and outputs word labels
  3. Language model constrains which word sequences are valid

Given a test utterance represented as a sequence of phone observations, the composed transducer finds the best path and outputs the recognized digit. The test set had 22 utterances.

The composition was done using external FST tools (OpenFst). The Python scripts just generate the text-format FST files that get compiled and composed outside of Python.

Finite-State Transducers for Cards and Speech
https://theanig.dev/blog/slp-lab1-finite-state-transducers
Author Anirudh Ganesh
Published at April 25, 2020