BiLSTM-CNN-CRF for Sequence Labeling: A Tutorial

Background#

Named entity recognition and POS tagging used to require a lot of hand-crafted features: orthographic patterns, gazetteers, suffix lists, capitalization rules. The Ma and Hovy paper from ACL 2016, End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF ¹, removed almost all of that. Their architecture combines three components:

a character-level CNN that captures morphological patterns like prefixes and suffixes,
a word-level BiLSTM that consumes the concatenation of GloVe embeddings and the CNN output, and
a CRF layer on top that enforces tagging constraints (for example, an I-ORG tag cannot directly follow an I-PER tag).

The reported numbers are 97.55% accuracy on Penn Treebank WSJ POS tagging and 91.21% F1 on CoNLL 2003 NER. Both were SOTA at the time of publication.

Why I picked it for reproducibility#

I presented this work at the Enabling Reproducibility in Machine Learning workshop at ICML 2018. The reason I picked this particular paper: it touches every component of a modern sequence-labeling pipeline (character encoder, word encoder, structured output layer), the original results matter to a lot of downstream work, and the official implementation was hard to run on contemporary PyTorch. Reproducing it forces you to make every step concrete, which is the point of a reproducibility exercise.

My implementation is at TheAnig/NER-LSTM-CNN-Pytorch ↗ ². The rest of this post walks through what’s in it.

What the workshop discussion converged on, repeatedly: a public GitHub repository is necessary but not sufficient for reproducibility. The harder requirements are pinned hyperparameters that actually match the paper, a setup that runs on a CPU as well as a GPU, and documentation that explains the small decisions (preprocessing, initialization, batching) that the paper itself usually doesn’t spell out.

Tutorial#

Data Preparation#

The dataset is CoNLL 2003, tagged for four entity types: PERSON, LOCATION, ORGANIZATION, MISC.

The dataset uses the BIO tagging scheme:

B-TYPE indicates the beginning of an entity of type TYPE.
I-TYPE indicates the continuation of the entity.
O indicates that the word is not part of any entity.

For example:

U.N.         NNP  I-NP  I-ORG
official     NN   I-NP  O
Ekeus        NNP  I-NP  I-PER
heads        VBZ  I-VP  O
for          IN   I-PP  O
Baghdad      NNP  I-NP  I-LOC

plaintext

The paper preprocesses by replacing all digits with 0. This collapses numeric tokens that would otherwise blow up the vocabulary without carrying semantic content for NER.

def zero_digits(s):
    return re.sub('\\d', '0', s)

def load_sentences(path, zeros):
    sentences = []
    sentence = []
    for line in open(path, 'r', encoding='utf-8'):
        line = zero_digits(line.rstrip()) if zeros else line.rstrip()
        if not line:
            if len(sentence) > 0:
                sentences.append(sentence)
                sentence = []
        else:
            word = line.split()
            assert len(word) >= 2
            sentence.append(word)
    return sentences

python

To match the paper we convert BIO to BIOES, which adds two more tags:

E-TYPE for the end of an entity.
S-TYPE for single-token entities.

def iob_iobes(tags):
    """
    Convert BIO tagging to BIOES.
    """
    new_tags = []
    for i, tag in enumerate(tags):
        if tag == 'O':
            new_tags.append(tag)
        elif tag.split('-')[0] == 'B':
            if i + 1 != len(tags) and \
               tags[i + 1].split('-')[0] == 'I':
                new_tags.append(tag)
            else:
                new_tags.append(tag.replace('B-', 'S-'))
        elif tag.split('-')[0] == 'I':
            if i + 1 < len(tags) and \
                    tags[i + 1].split('-')[0] == 'I':
                new_tags.append(tag)
            else:
                new_tags.append(tag.replace('I-', 'E-'))
        else:
            raise Exception('Invalid IOB format!')
    return new_tags

python

Mappings and Word Embeddings#

Words, characters, and tags get mapped to integer IDs so the whole pipeline runs as tensor lookups. For word embeddings, we use the 100-dimensional GloVe vectors trained on Wikipedia and Gigaword. The paper compares several embedding sources; GloVe-100 is the one that gave them the headline numbers, so we match it.

word_to_id, id_to_word = create_mapping(train_sentences)
char_to_id, id_to_char = create_mapping(train_sentences, level='char')
tag_to_id, id_to_tag = create_mapping(train_sentences, level='tag')

python

We load GloVe into a matrix and initialize any out-of-vocabulary words randomly within the same range as the GloVe values:

word_embeds = np.random.uniform(-np.sqrt(0.06), np.sqrt(0.06), (len(word_to_id), 100))
for word in word_to_id:
    if word in glove_embeds:
        word_embeds[word_to_id[word]] = glove_embeds[word]

python

Model Architecture#

Three components:

CNN for character embeddings: convolutional + max-pool layers over the characters of each word, producing a fixed-size vector that captures morphology.
BiLSTM for word encoding: takes the concatenation of GloVe embeddings and the character-CNN output, runs it forward and backward.
CRF for structured prediction: a linear-chain CRF on top of the BiLSTM outputs, so the tagger can express constraints between adjacent tags.

The model is non-trivial. To keep the code readable I split it into a handful of functions that each correspond to one block above, so you can follow the data flow without holding the whole thing in your head at once.

Initialization#

The init_embedding function samples from $\mathcal{U}\left(-\sqrt{\frac{3}{V}}, +\sqrt{\frac{3}{V}}\right)$ where $V$ is the embedding dimension.

    bias = np.sqrt(3.0 / input_embedding.size(1))
    nn.init.uniform(input_embedding, -bias, bias)

python

The LSTM weights use Glorot-uniform: samples from $\mathcal{U}\left(-\sqrt{\frac{6}{r + c}}, +\sqrt{\frac{6}{r + c}}\right)$ , where $r$ and $c$ are the number of rows and columns of the weight matrix.

    bias = np.sqrt(6.0 / (input_linear.weight.size(0) + input_linear.weight.size(1)))
    nn.init.uniform(input_linear.weight, -bias, bias)

python

CRF layer#

Two options for the output layer:

softmax: normalize the per-token scores into a probability distribution over tags. The probability of a tag sequence is the product of per-token probabilities. This makes the tagging decision local — each token’s tag depends only on the token’s own score.
linear-chain CRF: scores entire sequences jointly. Given a sequence of words $w_1, ..., w_m$ , score vectors $s_1, ..., s_m$ , and tags $y_1, ..., y_m$ :

C(y_1, ..., y_m) = b[y_1] + \sum_{t=1}^m s_t[y_t] + \sum_{t=1}^{m-1}T[y_t, y_{t+1}] + e[y_m]

where $T$ is a transition matrix in $\R^{9 \times 9}$ and $e, b \in \R^9$ are start/end score vectors. The transition matrix is what lets the model learn that I-ORG should not follow I-PER.

The CRF predicts at the sentence level. Per-token scores feed in from the BiLSTM, and Viterbi decoding finds the highest-scoring tag sequence.

The practical reason CRF beats softmax here: NER tags have strong sequential constraints (I- tags only follow B- or I- tags of the same type, E- ends entities, and so on). Softmax doesn’t know about those constraints, so it occasionally emits illegal transitions. The CRF enforces them via the learned transition matrix.

A simple CRF network. In our model the inputs come from the BiLSTM, but the topology is otherwise the same.

Score Calculation#

The CRF computes a conditional probability. For tag sequence $y$ and input sequence $x$ :

P(y|x) = \frac{exp(Score(x, y))}{\sum_{y'}exp(Score(x, y'))}

The score is a sum of log potentials:

Score(x, y) = \sum_i log\psi_i(x, y)

def log_sum_exp(vec):
    '''
    Numerically stable log-sum-exp for the forward algorithm.
    vec 2D: 1 * tagset_size
    '''
    max_score = vec[0, argmax(vec)]
    max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
    return max_score + torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))

python

score_sentences computes the score of the gold tag sequence given the BiLSTM features. This is the numerator in the CRF probability above; the denominator is the partition function computed by the forward algorithm.

def score_sentences(self, feats, tags):
    # tags is ground_truth, a list of ints, length is len(sentence)
    # feats is a 2D tensor, len(sentence) * tagset_size
    r = torch.LongTensor(range(feats.size()[0]))
    if self.use_gpu:
        r = r.cuda()
        pad_start_tags = torch.cat([torch.cuda.LongTensor([self.tag_to_ix[START_TAG]]), tags])
        pad_stop_tags = torch.cat([tags, torch.cuda.LongTensor([self.tag_to_ix[STOP_TAG]])])
    else:
        pad_start_tags = torch.cat([torch.LongTensor([self.tag_to_ix[START_TAG]]), tags])
        pad_stop_tags = torch.cat([tags, torch.LongTensor([self.tag_to_ix[STOP_TAG]])])

    score = torch.sum(self.transitions[pad_stop_tags, pad_start_tags]) + torch.sum(feats[r, tags])

    return score

python

Viterbi decode#

Viterbi is dynamic programming over the tag lattice. The recurrence:

$\tilde s_t(y_t) = \arg\max_{y_{t+1}, ..., y_m} C(y_t, ..., y_m) = \arg\max_{y_{t+1}} s_t[y_t] + T[y_t, y_{t+1}] + \tilde s_{t+1}(y_{t+1})$

The probability of a given sequence is

\mathbb{P}(y_1, ..., y_m) = \frac{e^{C(y_1, ..., y_m)}}{Z}

where $Z$ is the partition function.

def forward_calc(self, sentence, chars, chars2_length, d):

    '''
    Calls viterbi decode and returns the most probable tag sequence.
    '''

    # Get the emission scores from the BiLSTM
    feats = self._get_lstm_features(sentence, chars, chars2_length, d)

    # Find the best path, given the features.
    if self.use_crf:
        score, tag_seq = self.viterbi_decode(feats)
    else:
        score, tag_seq = torch.max(feats, 1)
        tag_seq = list(tag_seq.cpu().data)

    return score, tag_seq

python

Model Details#

CNN for character embeddings#

Take the word cat. Pad it on both ends to a fixed maximum word length — this is an implementation quirk; PyTorch convolutions need fixed-size input, and the model ignores the pads.

Convolution Model for generating character embeddings

Apply a 1D convolution over the character embeddings, then max-pool. The output is a dense vector for the word, which gets concatenated with the GloVe lookup before going to the BiLSTM.

self.char_cnn3 = nn.Conv2d(in_channels=1, out_channels=self.out_channels, kernel_size=(3, char_embedding_dim), padding=(2,0))

python

BiLSTM for tag generation#

The concatenated word-level vector (GloVe + char-CNN) feeds into a 2-layer bidirectional LSTM.

The forward LSTM produces a hidden state at each position that summarizes everything seen up to and including the current word.
The backward LSTM does the same starting from the end.
The two hidden states are concatenated to give a context-aware representation for each word.

self.lstm = nn.LSTM(embedding_dim+self.out_channels, hidden_dim, bidirectional=True)

python

Main model#

get_lstm_features ties the per-word pipeline together:

Look up character embeddings, run them through the CNN.
Concatenate the char-CNN output with the GloVe vector for the word.
Run the result through the BiLSTM.
Project to tag space with a linear layer.

if self.char_mode == 'LSTM':
    chars_embeds = self.char_embeds(chars2).transpose(0, 1)
    packed = torch.nn.utils.rnn.pack_padded_sequence(chars_embeds, chars2_length)
    lstm_out, _ = self.char_lstm(packed)
    outputs, output_lengths = torch.nn.utils.rnn.pad_packed_sequence(lstm_out)
    outputs = outputs.transpose(0, 1)
    chars_embeds_temp = Variable(torch.FloatTensor(torch.zeros((outputs.size(0), outputs.size(2)))))
    if self.use_gpu:
        chars_embeds_temp = chars_embeds_temp.cuda()
    for i, index in enumerate(output_lengths):
        chars_embeds_temp[i] = torch.cat((outputs[i, index-1, :self.char_lstm_dim], outputs[i, 0, self.char_lstm_dim:]))
    chars_embeds = chars_embeds_temp.clone()
    for i in range(chars_embeds.size(0)):
        chars_embeds[d[i]] = chars_embeds_temp[i]

if self.char_mode == 'CNN':
    chars_embeds = self.char_embeds(chars2).unsqueeze(1)
    ## Character-level representation via Conv2d + maxpool
    chars_cnn_out3 = self.char_cnn3(chars_embeds)
    chars_embeds = nn.functional.max_pool2d(chars_cnn_out3,
    kernel_size=(chars_cnn_out3.size(2), 1)).view(chars_cnn_out3.size(0), self.out_channels)

python

Training#

SGD with learning rate 0.015 and momentum 0.9, matching the paper. Dropout and gradient clipping for regularization.

optimizer = torch.optim.SGD(model.parameters(), lr=0.015, momentum=0.9)

python

The loss is the negative log-likelihood of the gold tag sequence under the CRF:

def neg_log_likelihood(sentence, tags):
    feats = get_lstm_features(sentence)
    forward_score = forward_algorithm(feats)
    gold_score = score_sentence(feats, tags)
    return forward_score - gold_score

python

Evaluation#

Standard precision, recall, and F1 against the CoNLL 2003 test split. A sample prediction:

sentence = "Jay is from India."
prediction = model.predict(sentence)
print(prediction)

python

Jay: PER
is: NA
from: NA
India: LOC

yaml

What I took away#

A few things the workshop conversations clarified for me:

The single most leverage-bearing decision in reproducing this paper was matching the paper’s initialization choices exactly. Glorot for the LSTM, uniform-from-a-specific-range for the embeddings. Skip those and the convergence trajectory changes enough that you stop hitting the reported numbers.
The CRF layer matters more than its presentation in the paper suggests. With softmax instead, the model emits illegal tag sequences at a rate that costs about 1.5 F1 points on CoNLL test.
“Run on a fresh machine with one command” is a higher bar than it sounds. The original implementation depended on a specific version of an older framework; getting the same numbers in current PyTorch required rewriting more than just import statements.

The architecture itself has been mostly superseded by transformer-based NER, but the reproducibility lessons port directly to any modern paper.

Link to the original paper: End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF ↗ ↩
Link to the repository: NER-LSTM-CNN-Pytorch ↗ ↩

Background#

Why I picked it for reproducibility#

Tutorial#

Data Preparation#

Mappings and Word Embeddings#

Model Architecture#

Initialization#

CRF layer#

Score Calculation#

Viterbi decode#

Model Details#

CNN for character embeddings#

BiLSTM for tag generation#

Main model#

Training#

Evaluation#

What I took away#

Footnotes#