

BiLSTM-CNN-CRF for Sequence Labeling: A Tutorial
A PyTorch reproduction of the End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF paper for the ICML 2018 reproducibility workshop.
Background#
Named entity recognition and POS tagging used to require a lot of hand-crafted features: orthographic patterns, gazetteers, suffix lists, capitalization rules. The Ma and Hovy paper from ACL 2016, End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF 1, removed almost all of that. Their architecture combines three components:
- a character-level CNN that captures morphological patterns like prefixes and suffixes,
- a word-level BiLSTM that consumes the concatenation of GloVe embeddings and the CNN output, and
- a CRF layer on top that enforces tagging constraints (for example, an I-ORG tag cannot directly follow an I-PER tag).
The reported numbers are 97.55% accuracy on Penn Treebank WSJ POS tagging and 91.21% F1 on CoNLL 2003 NER. Both were SOTA at the time of publication.
Why I picked it for reproducibility#
I presented this work at the Enabling Reproducibility in Machine Learning workshop at ICML 2018. The reason I picked this particular paper: it touches every component of a modern sequence-labeling pipeline (character encoder, word encoder, structured output layer), the original results matter to a lot of downstream work, and the official implementation was hard to run on contemporary PyTorch. Reproducing it forces you to make every step concrete, which is the point of a reproducibility exercise.
My implementation is at TheAnig/NER-LSTM-CNN-Pytorch ↗ 2. The rest of this post walks through what’s in it.
What the workshop discussion converged on, repeatedly: a public GitHub repository is necessary but not sufficient for reproducibility. The harder requirements are pinned hyperparameters that actually match the paper, a setup that runs on a CPU as well as a GPU, and documentation that explains the small decisions (preprocessing, initialization, batching) that the paper itself usually doesn’t spell out.
Tutorial#
Data Preparation#
The dataset is CoNLL 2003, tagged for four entity types: PERSON, LOCATION, ORGANIZATION, MISC.
The dataset uses the BIO tagging scheme:
B-TYPEindicates the beginning of an entity of type TYPE.I-TYPEindicates the continuation of the entity.Oindicates that the word is not part of any entity.
For example:
U.N. NNP I-NP I-ORG
official NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
Baghdad NNP I-NP I-LOCplaintextThe paper preprocesses by replacing all digits with 0. This collapses numeric tokens that would otherwise blow up the vocabulary without carrying semantic content for NER.
def zero_digits(s):
return re.sub('\\d', '0', s)
def load_sentences(path, zeros):
sentences = []
sentence = []
for line in open(path, 'r', encoding='utf-8'):
line = zero_digits(line.rstrip()) if zeros else line.rstrip()
if not line:
if len(sentence) > 0:
sentences.append(sentence)
sentence = []
else:
word = line.split()
assert len(word) >= 2
sentence.append(word)
return sentencespythonTo match the paper we convert BIO to BIOES, which adds two more tags:
E-TYPEfor the end of an entity.S-TYPEfor single-token entities.
def iob_iobes(tags):
"""
Convert BIO tagging to BIOES.
"""
new_tags = []
for i, tag in enumerate(tags):
if tag == 'O':
new_tags.append(tag)
elif tag.split('-')[0] == 'B':
if i + 1 != len(tags) and \
tags[i + 1].split('-')[0] == 'I':
new_tags.append(tag)
else:
new_tags.append(tag.replace('B-', 'S-'))
elif tag.split('-')[0] == 'I':
if i + 1 < len(tags) and \
tags[i + 1].split('-')[0] == 'I':
new_tags.append(tag)
else:
new_tags.append(tag.replace('I-', 'E-'))
else:
raise Exception('Invalid IOB format!')
return new_tagspythonMappings and Word Embeddings#
Words, characters, and tags get mapped to integer IDs so the whole pipeline runs as tensor lookups. For word embeddings, we use the 100-dimensional GloVe vectors trained on Wikipedia and Gigaword. The paper compares several embedding sources; GloVe-100 is the one that gave them the headline numbers, so we match it.
word_to_id, id_to_word = create_mapping(train_sentences)
char_to_id, id_to_char = create_mapping(train_sentences, level='char')
tag_to_id, id_to_tag = create_mapping(train_sentences, level='tag')pythonWe load GloVe into a matrix and initialize any out-of-vocabulary words randomly within the same range as the GloVe values:
word_embeds = np.random.uniform(-np.sqrt(0.06), np.sqrt(0.06), (len(word_to_id), 100))
for word in word_to_id:
if word in glove_embeds:
word_embeds[word_to_id[word]] = glove_embeds[word]pythonModel Architecture#
Three components:
- CNN for character embeddings: convolutional + max-pool layers over the characters of each word, producing a fixed-size vector that captures morphology.
- BiLSTM for word encoding: takes the concatenation of GloVe embeddings and the character-CNN output, runs it forward and backward.
- CRF for structured prediction: a linear-chain CRF on top of the BiLSTM outputs, so the tagger can express constraints between adjacent tags.
The model is non-trivial. To keep the code readable I split it into a handful of functions that each correspond to one block above, so you can follow the data flow without holding the whole thing in your head at once.
Initialization#
The init_embedding function samples from where is the embedding dimension.
bias = np.sqrt(3.0 / input_embedding.size(1))
nn.init.uniform(input_embedding, -bias, bias)pythonThe LSTM weights use Glorot-uniform: samples from , where and are the number of rows and columns of the weight matrix.
bias = np.sqrt(6.0 / (input_linear.weight.size(0) + input_linear.weight.size(1)))
nn.init.uniform(input_linear.weight, -bias, bias)pythonCRF layer#
Two options for the output layer:
-
softmax: normalize the per-token scores into a probability distribution over tags. The probability of a tag sequence is the product of per-token probabilities. This makes the tagging decision local — each token’s tag depends only on the token’s own score.
-
linear-chain CRF: scores entire sequences jointly. Given a sequence of words , score vectors , and tags :
where is a transition matrix in and are start/end score vectors. The transition matrix is what lets the model learn that I-ORG should not follow I-PER.
The CRF predicts at the sentence level. Per-token scores feed in from the BiLSTM, and Viterbi decoding finds the highest-scoring tag sequence.
The practical reason CRF beats softmax here: NER tags have strong sequential constraints (I- tags only follow B- or I- tags of the same type, E- ends entities, and so on). Softmax doesn’t know about those constraints, so it occasionally emits illegal transitions. The CRF enforces them via the learned transition matrix.
Score Calculation#
The CRF computes a conditional probability. For tag sequence and input sequence :
The score is a sum of log potentials:
def log_sum_exp(vec):
'''
Numerically stable log-sum-exp for the forward algorithm.
vec 2D: 1 * tagset_size
'''
max_score = vec[0, argmax(vec)]
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
return max_score + torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))pythonscore_sentences computes the score of the gold tag sequence given the BiLSTM features. This is the numerator in the CRF probability above; the denominator is the partition function computed by the forward algorithm.
def score_sentences(self, feats, tags):
# tags is ground_truth, a list of ints, length is len(sentence)
# feats is a 2D tensor, len(sentence) * tagset_size
r = torch.LongTensor(range(feats.size()[0]))
if self.use_gpu:
r = r.cuda()
pad_start_tags = torch.cat([torch.cuda.LongTensor([self.tag_to_ix[START_TAG]]), tags])
pad_stop_tags = torch.cat([tags, torch.cuda.LongTensor([self.tag_to_ix[STOP_TAG]])])
else:
pad_start_tags = torch.cat([torch.LongTensor([self.tag_to_ix[START_TAG]]), tags])
pad_stop_tags = torch.cat([tags, torch.LongTensor([self.tag_to_ix[STOP_TAG]])])
score = torch.sum(self.transitions[pad_stop_tags, pad_start_tags]) + torch.sum(feats[r, tags])
return scorepythonViterbi decode#
Viterbi is dynamic programming over the tag lattice. The recurrence:
The probability of a given sequence is
where is the partition function.
def forward_calc(self, sentence, chars, chars2_length, d):
'''
Calls viterbi decode and returns the most probable tag sequence.
'''
# Get the emission scores from the BiLSTM
feats = self._get_lstm_features(sentence, chars, chars2_length, d)
# Find the best path, given the features.
if self.use_crf:
score, tag_seq = self.viterbi_decode(feats)
else:
score, tag_seq = torch.max(feats, 1)
tag_seq = list(tag_seq.cpu().data)
return score, tag_seqpythonModel Details#
CNN for character embeddings#
Take the word cat. Pad it on both ends to a fixed maximum word length — this is an implementation quirk; PyTorch convolutions need fixed-size input, and the model ignores the pads.
Apply a 1D convolution over the character embeddings, then max-pool. The output is a dense vector for the word, which gets concatenated with the GloVe lookup before going to the BiLSTM.
self.char_cnn3 = nn.Conv2d(in_channels=1, out_channels=self.out_channels, kernel_size=(3, char_embedding_dim), padding=(2,0))pythonBiLSTM for tag generation#
The concatenated word-level vector (GloVe + char-CNN) feeds into a 2-layer bidirectional LSTM.
- The forward LSTM produces a hidden state at each position that summarizes everything seen up to and including the current word.
- The backward LSTM does the same starting from the end.
- The two hidden states are concatenated to give a context-aware representation for each word.
self.lstm = nn.LSTM(embedding_dim+self.out_channels, hidden_dim, bidirectional=True)pythonMain model#
get_lstm_features ties the per-word pipeline together:
- Look up character embeddings, run them through the CNN.
- Concatenate the char-CNN output with the GloVe vector for the word.
- Run the result through the BiLSTM.
- Project to tag space with a linear layer.
if self.char_mode == 'LSTM':
chars_embeds = self.char_embeds(chars2).transpose(0, 1)
packed = torch.nn.utils.rnn.pack_padded_sequence(chars_embeds, chars2_length)
lstm_out, _ = self.char_lstm(packed)
outputs, output_lengths = torch.nn.utils.rnn.pad_packed_sequence(lstm_out)
outputs = outputs.transpose(0, 1)
chars_embeds_temp = Variable(torch.FloatTensor(torch.zeros((outputs.size(0), outputs.size(2)))))
if self.use_gpu:
chars_embeds_temp = chars_embeds_temp.cuda()
for i, index in enumerate(output_lengths):
chars_embeds_temp[i] = torch.cat((outputs[i, index-1, :self.char_lstm_dim], outputs[i, 0, self.char_lstm_dim:]))
chars_embeds = chars_embeds_temp.clone()
for i in range(chars_embeds.size(0)):
chars_embeds[d[i]] = chars_embeds_temp[i]
if self.char_mode == 'CNN':
chars_embeds = self.char_embeds(chars2).unsqueeze(1)
## Character-level representation via Conv2d + maxpool
chars_cnn_out3 = self.char_cnn3(chars_embeds)
chars_embeds = nn.functional.max_pool2d(chars_cnn_out3,
kernel_size=(chars_cnn_out3.size(2), 1)).view(chars_cnn_out3.size(0), self.out_channels)pythonTraining#
SGD with learning rate 0.015 and momentum 0.9, matching the paper. Dropout and gradient clipping for regularization.
optimizer = torch.optim.SGD(model.parameters(), lr=0.015, momentum=0.9)pythonThe loss is the negative log-likelihood of the gold tag sequence under the CRF:
def neg_log_likelihood(sentence, tags):
feats = get_lstm_features(sentence)
forward_score = forward_algorithm(feats)
gold_score = score_sentence(feats, tags)
return forward_score - gold_scorepythonEvaluation#
Standard precision, recall, and F1 against the CoNLL 2003 test split. A sample prediction:
sentence = "Jay is from India."
prediction = model.predict(sentence)
print(prediction)pythonJay: PER
is: NA
from: NA
India: LOCyamlWhat I took away#
A few things the workshop conversations clarified for me:
- The single most leverage-bearing decision in reproducing this paper was matching the paper’s initialization choices exactly. Glorot for the LSTM, uniform-from-a-specific-range for the embeddings. Skip those and the convergence trajectory changes enough that you stop hitting the reported numbers.
- The CRF layer matters more than its presentation in the paper suggests. With softmax instead, the model emits illegal tag sequences at a rate that costs about 1.5 F1 points on CoNLL test.
- “Run on a fresh machine with one command” is a higher bar than it sounds. The original implementation depended on a specific version of an older framework; getting the same numbers in current PyTorch required rewriting more than just import statements.
The architecture itself has been mostly superseded by transformer-based NER, but the reproducibility lessons port directly to any modern paper.
Footnotes#
-
Link to the original paper: End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF ↗ ↩
-
Link to the repository: NER-LSTM-CNN-Pytorch ↗ ↩