HMM Part-of-Speech Tagger with Baum-Welch • TheAnig

Abstract#

Build a Hidden Markov Model for part-of-speech tagging. Train it two ways: supervised from a tagged corpus using maximum likelihood estimation, and unsupervised using the Baum-Welch (forward-backward) algorithm. Decode with Viterbi.

This lab was about implementing an HMM from scratch for POS tagging. The training data was 22,486 tagged sentences from pos_train.txt where each token looks like word/TAG. We tested on 1,100 sentences from pos_test.txt. The implementation supports both supervised training (just count tag bigrams and word-tag pairs) and unsupervised training (Baum-Welch EM).

The HMM setup#

An HMM for POS tagging has two probability matrices:

Transition matrix $A$ : $P(\text{tag}_j \mid \text{tag}_i)$ for each pair of tags
Observation matrix $B$ : $P(\text{word} \mid \text{tag})$ for each word-tag pair

We also need boundary symbols. <s> marks the start of a sentence and </s> marks the end. So $P(\text{tag}_1 \mid \texttt{<s>})$ gives the probability of starting a sentence with a particular tag, and $P(\texttt{</s>} \mid \text{tag}_T)$ gives the probability of ending after a particular tag.

Supervised training#

In supervised mode, we have the gold tags, so training is just counting:

$P(\text{tag}_j \mid \text{tag}_i) = \frac{\text{count}(\text{tag}_i, \text{tag}_j)}{\sum_k \text{count}(\text{tag}_i, \text{tag}_k)}$

$P(\text{word} \mid \text{tag}) = \frac{\text{count}(\text{tag}, \text{word})}{\sum_w \text{count}(\text{tag}, w)}$

Unknown words at test time get mapped to <UNK>, which is assigned to the most common tag with a small epsilon probability (0.000001). This is a rough heuristic but it works for the most frequent tag.

Viterbi decoding#

Viterbi finds the most likely tag sequence given the observations. The implementation uses negative log probabilities so we can add instead of multiply, and minimize instead of maximize. This avoids underflow on long sentences.

def viterbi(self, words):
    trellis = {}
    for tag in self.states:
        trellis[tag] = [
            self.get_log_prob(self.trans_prob, '<s>', tag),
            [tag]
        ]
        if words[0] in self.vocabulary:
            trellis[tag][0] += self.get_log_prob(
                self.obs_prob, tag, words[0])
        else:
            trellis[tag][0] += self.get_log_prob(
                self.obs_prob, tag, '<UNK>')

    new_trellis = {}
    for word in words[1:]:
        for cur_tag in self.states:
            cur_min_prob = float('inf')
            cur_min_path = None
            for prev_tag in self.states:
                prob = (trellis[prev_tag][0]
                    + self.get_log_prob(
                        self.trans_prob, prev_tag, cur_tag))
                if word in self.vocabulary:
                    prob += self.get_log_prob(
                        self.obs_prob, cur_tag, word)
                else:
                    prob += self.get_log_prob(
                        self.obs_prob, cur_tag, '<UNK>')
                if prob <= cur_min_prob:
                    cur_min_prob = prob
                    cur_min_path = trellis[prev_tag][1] + [cur_tag]
            new_trellis[cur_tag] = [cur_min_prob, cur_min_path]
        trellis = new_trellis
        new_trellis = {}

    # End-of-sentence transition
    cur_min_prob = float('inf')
    cur_min_path = None
    for tag in self.states:
        prob = (self.get_log_prob(self.trans_prob, tag, '</s>')
            + trellis[tag][0])
        if prob <= cur_min_prob:
            cur_min_prob = prob
            cur_min_path = trellis[tag][1]
    return cur_min_path

python

Each entry in the trellis stores the best (minimum) negative log probability and the corresponding tag path. At each word position, we iterate over all current tags and all previous tags to find the best predecessor. The get_log_prob helper returns -log(p) if the probability is positive, and infinity if it’s zero.

Unsupervised training: Baum-Welch#

When you don’t have labeled data, you can use the Baum-Welch algorithm (a special case of EM) to estimate the HMM parameters. The idea is to iterate between computing expected counts from the current model (E-step) and re-estimating the parameters from those counts (M-step).

Forward algorithm#

The forward variable $\alpha_t(s)$ is the probability of producing observations $o_1, \ldots, o_t$ and being in state $s$ at time $t$ .

$\alpha_0(s) = P(s \mid \texttt{<s>}) \cdot P(o_0 \mid s)$

$\alpha_t(s) = \sum_{s'} \alpha_{t-1}(s') \cdot P(s \mid s') \cdot P(o_t \mid s)$

def _forward(self, observations):
    trellis = [{}]
    for state in self.states:
        trellis[0][state] = (self.trans_prob["<s>"][state]
            * self.obs_prob[state][observations[0]])

    for t in range(1, len(observations)):
        trellis.append({})
        for state in self.states:
            trellis[t][state] = sum(
                trellis[t-1][prev_state]
                * self.trans_prob[prev_state][state]
                * self.obs_prob[state][observations[t]]
                for prev_state in self.states)

    # End-of-sentence
    trellis.append({})
    trellis[-1]['</s>'] = sum(
        trellis[-2][s] * self.trans_prob[s]['</s>']
        for s in self.states)
    return trellis

python

The final entry gives the total probability of the observation sequence under the model: $P(O \mid \lambda) = \alpha_T(\texttt{</s>})$ .

Backward algorithm#

The backward variable $\beta_t(s)$ is the probability of producing observations $o_{t+1}, \ldots, o_T$ given state $s$ at time $t$ .

$\beta_T(s) = P(\texttt{</s>} \mid s)$

$\beta_t(s) = \sum_{s'} P(s' \mid s) \cdot P(o_{t+1} \mid s') \cdot \beta_{t+1}(s')$

def _backward(self, observations):
    trellis = [{}]
    for state in self.states:
        trellis[0][state] = self.trans_prob[state]['</s>']

    for t in range(len(observations)-1, 0, -1):
        trellis.insert(0, {})
        for state in self.states:
            trellis[0][state] = sum(
                trellis[1][next_state]
                * self.trans_prob[state][next_state]
                * self.obs_prob[next_state][observations[t]]
                for next_state in self.states)

    trellis.insert(0, {})
    trellis[0]['<s>'] = sum(
        trellis[1][s]
        * self.trans_prob['<s>'][s]
        * self.obs_prob[s][observations[0]]
        for s in self.states)
    return trellis

python

E-step#

With the forward and backward values, we compute two sets of expected counts:

$\xi_t(i, j)$ : the expected number of transitions from state $i$ to state $j$ at time $t$

$\xi_t(i, j) = \frac{\alpha_{t-1}(i) \cdot P(j \mid i) \cdot P(o_t \mid j) \cdot \beta_{t+1}(j)}{P(O \mid \lambda)}$

$\gamma_t(i)$ : the expected number of times we’re in state $i$ at time $t$

$\gamma_t(i) = \frac{\alpha_t(i) \cdot \beta_{t+1}(i)}{P(O \mid \lambda)}$

These get accumulated across all sentences in the training data.

M-step#

The re-estimation updates are:

$P(j \mid i) = \frac{\sum_t \xi_t(i, j)}{\sum_t \sum_k \xi_t(i, k)}$

$P(o \mid i) = \frac{\sum_{t: o_t = o} \gamma_t(i)}{\sum_t \gamma_t(i)}$

After re-estimation, we reset the accumulators and run another epoch. The implementation runs 2 epochs of Baum-Welch over the full 22,486 training sentences with a tqdm progress bar.

Debugging with the Eisner ice cream problem#

Before running Baum-Welch on the full POS data, I tested the implementation on the classic Eisner ice cream problem. This is a toy HMM with two states (Hot and Cold) and three observations (ice cream counts 1, 2, 3). The transition and emission probabilities are known, and you can verify the forward-backward computations against a spreadsheet.

The Eisner version hardcodes the parameters:

$P(C \mid C) = 0.8$ , $P(H \mid H) = 0.8$ , $P(H \mid C) = P(C \mid H) = 0.1$
$P(1 \mid C) = 0.7$ , $P(2 \mid C) = 0.2$ , $P(3 \mid C) = 0.1$
$P(1 \mid H) = 0.1$ , $P(2 \mid H) = 0.2$ , $P(3 \mid H) = 0.7$

Running 13 iterations of Baum-Welch on a fixed sequence of 33 ice cream observations let me check that the forward-backward implementation was producing the right alpha and beta values before scaling up to real data.

Putting it together#

The supervised pipeline is simple: read the tagged corpus, count transitions and emissions, run Viterbi on test data, compute word-level accuracy. The unsupervised pipeline initializes all counts to 1 (uniform), runs Baum-Welch for 2 epochs, then decodes with Viterbi.

The code also supports a case-insensitive mode that lowercases all words before training and evaluation. This can help with unknown word handling since capitalized words at the start of sentences get matched to their lowercase forms seen elsewhere.

The evaluation computes word-level accuracy by comparing the predicted tag sequence against the gold tags for each test sentence.