TheAnig

Back

BiLSTM-CNN-CRF architecture for named entity recognition
My BOOX N96. Perfect for reading or taking notes, and the battery lasts forever
BiLSTM-CNN-CRF architecture for named entity recognition

Background#

Named entity recognition and POS tagging used to require a lot of hand-crafted features: orthographic patterns, gazetteers, suffix lists, capitalization rules. The Ma and Hovy paper from ACL 2016, End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF 1, removed almost all of that. Their architecture combines three components:

  • a character-level CNN that captures morphological patterns like prefixes and suffixes,
  • a word-level BiLSTM that consumes the concatenation of GloVe embeddings and the CNN output, and
  • a CRF layer on top that enforces tagging constraints (for example, an I-ORG tag cannot directly follow an I-PER tag).

The reported numbers are 97.55% accuracy on Penn Treebank WSJ POS tagging and 91.21% F1 on CoNLL 2003 NER. Both were SOTA at the time of publication.

Why I picked it for reproducibility#

I presented this work at the Enabling Reproducibility in Machine Learning workshop at ICML 2018. The reason I picked this particular paper: it touches every component of a modern sequence-labeling pipeline (character encoder, word encoder, structured output layer), the original results matter to a lot of downstream work, and the official implementation was hard to run on contemporary PyTorch. Reproducing it forces you to make every step concrete, which is the point of a reproducibility exercise.

My implementation is at TheAnig/NER-LSTM-CNN-Pytorch 2. The rest of this post walks through what’s in it.

What the workshop discussion converged on, repeatedly: a public GitHub repository is necessary but not sufficient for reproducibility. The harder requirements are pinned hyperparameters that actually match the paper, a setup that runs on a CPU as well as a GPU, and documentation that explains the small decisions (preprocessing, initialization, batching) that the paper itself usually doesn’t spell out.

Tutorial#

Data Preparation#

The dataset is CoNLL 2003, tagged for four entity types: PERSON, LOCATION, ORGANIZATION, MISC.

The dataset uses the BIO tagging scheme:

  • B-TYPE indicates the beginning of an entity of type TYPE.
  • I-TYPE indicates the continuation of the entity.
  • O indicates that the word is not part of any entity.

For example:

U.N.         NNP  I-NP  I-ORG
official     NN   I-NP  O
Ekeus        NNP  I-NP  I-PER
heads        VBZ  I-VP  O
for          IN   I-PP  O
Baghdad      NNP  I-NP  I-LOC
plaintext

The paper preprocesses by replacing all digits with 0. This collapses numeric tokens that would otherwise blow up the vocabulary without carrying semantic content for NER.

To match the paper we convert BIO to BIOES, which adds two more tags:

  • E-TYPE for the end of an entity.
  • S-TYPE for single-token entities.

Mappings and Word Embeddings#

Words, characters, and tags get mapped to integer IDs so the whole pipeline runs as tensor lookups. For word embeddings, we use the 100-dimensional GloVe vectors trained on Wikipedia and Gigaword. The paper compares several embedding sources; GloVe-100 is the one that gave them the headline numbers, so we match it.

word_to_id, id_to_word = create_mapping(train_sentences)
char_to_id, id_to_char = create_mapping(train_sentences, level='char')
tag_to_id, id_to_tag = create_mapping(train_sentences, level='tag')
python

We load GloVe into a matrix and initialize any out-of-vocabulary words randomly within the same range as the GloVe values:

word_embeds = np.random.uniform(-np.sqrt(0.06), np.sqrt(0.06), (len(word_to_id), 100))
for word in word_to_id:
    if word in glove_embeds:
        word_embeds[word_to_id[word]] = glove_embeds[word]
python

Model Architecture#

Three components:

  • CNN for character embeddings: convolutional + max-pool layers over the characters of each word, producing a fixed-size vector that captures morphology.
  • BiLSTM for word encoding: takes the concatenation of GloVe embeddings and the character-CNN output, runs it forward and backward.
  • CRF for structured prediction: a linear-chain CRF on top of the BiLSTM outputs, so the tagger can express constraints between adjacent tags.

The model is non-trivial. To keep the code readable I split it into a handful of functions that each correspond to one block above, so you can follow the data flow without holding the whole thing in your head at once.

Initialization#

The init_embedding function samples from U(3V,+3V)\mathcal{U}\left(-\sqrt{\frac{3}{V}}, +\sqrt{\frac{3}{V}}\right) where VV is the embedding dimension.

    bias = np.sqrt(3.0 / input_embedding.size(1))
    nn.init.uniform(input_embedding, -bias, bias)
python

The LSTM weights use Glorot-uniform: samples from U(6r+c,+6r+c)\mathcal{U}\left(-\sqrt{\frac{6}{r + c}}, +\sqrt{\frac{6}{r + c}}\right), where rr and cc are the number of rows and columns of the weight matrix.

    bias = np.sqrt(6.0 / (input_linear.weight.size(0) + input_linear.weight.size(1)))
    nn.init.uniform(input_linear.weight, -bias, bias)
python

CRF layer#

Two options for the output layer:

  • softmax: normalize the per-token scores into a probability distribution over tags. The probability of a tag sequence is the product of per-token probabilities. This makes the tagging decision local — each token’s tag depends only on the token’s own score.

  • linear-chain CRF: scores entire sequences jointly. Given a sequence of words w1,...,wmw_1, ..., w_m, score vectors s1,...,sms_1, ..., s_m, and tags y1,...,ymy_1, ..., y_m:

C(y1,...,ym)=b[y1]+t=1mst[yt]+t=1m1T[yt,yt+1]+e[ym]C(y_1, ..., y_m) = b[y_1] + \sum_{t=1}^m s_t[y_t] + \sum_{t=1}^{m-1}T[y_t, y_{t+1}] + e[y_m]

where TT is a transition matrix in R9×9\R^{9 \times 9} and e,bR9e, b \in \R^9 are start/end score vectors. The transition matrix is what lets the model learn that I-ORG should not follow I-PER.

The CRF predicts at the sentence level. Per-token scores feed in from the BiLSTM, and Viterbi decoding finds the highest-scoring tag sequence.

The practical reason CRF beats softmax here: NER tags have strong sequential constraints (I- tags only follow B- or I- tags of the same type, E- ends entities, and so on). Softmax doesn’t know about those constraints, so it occasionally emits illegal transitions. The CRF enforces them via the learned transition matrix.

A simple CRF network

A simple CRF network. In our model the inputs come from the BiLSTM, but the topology is otherwise the same.

Score Calculation#

The CRF computes a conditional probability. For tag sequence yy and input sequence xx:

P(yx)=exp(Score(x,y))yexp(Score(x,y))P(y|x) = \frac{exp(Score(x, y))}{\sum_{y'}exp(Score(x, y'))}

The score is a sum of log potentials:

Score(x,y)=ilogψi(x,y)Score(x, y) = \sum_i log\psi_i(x, y)
def log_sum_exp(vec):
    '''
    Numerically stable log-sum-exp for the forward algorithm.
    vec 2D: 1 * tagset_size
    '''
    max_score = vec[0, argmax(vec)]
    max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
    return max_score + torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))
python

score_sentences computes the score of the gold tag sequence given the BiLSTM features. This is the numerator in the CRF probability above; the denominator is the partition function computed by the forward algorithm.

def score_sentences(self, feats, tags):
    # tags is ground_truth, a list of ints, length is len(sentence)
    # feats is a 2D tensor, len(sentence) * tagset_size
    r = torch.LongTensor(range(feats.size()[0]))
    if self.use_gpu:
        r = r.cuda()
        pad_start_tags = torch.cat([torch.cuda.LongTensor([self.tag_to_ix[START_TAG]]), tags])
        pad_stop_tags = torch.cat([tags, torch.cuda.LongTensor([self.tag_to_ix[STOP_TAG]])])
    else:
        pad_start_tags = torch.cat([torch.LongTensor([self.tag_to_ix[START_TAG]]), tags])
        pad_stop_tags = torch.cat([tags, torch.LongTensor([self.tag_to_ix[STOP_TAG]])])

    score = torch.sum(self.transitions[pad_stop_tags, pad_start_tags]) + torch.sum(feats[r, tags])

    return score
python

Viterbi decode#

Viterbi is dynamic programming over the tag lattice. The recurrence:

s~t(yt)=argmaxyt+1,...,ymC(yt,...,ym)=argmaxyt+1st[yt]+T[yt,yt+1]+s~t+1(yt+1)\tilde s_t(y_t) = \arg\max_{y_{t+1}, ..., y_m} C(y_t, ..., y_m) = \arg\max_{y_{t+1}} s_t[y_t] + T[y_t, y_{t+1}] + \tilde s_{t+1}(y_{t+1})

The probability of a given sequence is

P(y1,...,ym)=eC(y1,...,ym)Z\mathbb{P}(y_1, ..., y_m) = \frac{e^{C(y_1, ..., y_m)}}{Z}

where ZZ is the partition function.

Model Details#

CNN for character embeddings#

Take the word cat. Pad it on both ends to a fixed maximum word length — this is an implementation quirk; PyTorch convolutions need fixed-size input, and the model ignores the pads.

Convolution Model for generating character embeddings

Convolution Model for generating character embeddings

Apply a 1D convolution over the character embeddings, then max-pool. The output is a dense vector for the word, which gets concatenated with the GloVe lookup before going to the BiLSTM.

self.char_cnn3 = nn.Conv2d(in_channels=1, out_channels=self.out_channels, kernel_size=(3, char_embedding_dim), padding=(2,0))
python

BiLSTM for tag generation#

The concatenated word-level vector (GloVe + char-CNN) feeds into a 2-layer bidirectional LSTM.

LSTMs for Tag Generation

LSTMs for Tag Generation

  • The forward LSTM produces a hidden state at each position that summarizes everything seen up to and including the current word.
  • The backward LSTM does the same starting from the end.
  • The two hidden states are concatenated to give a context-aware representation for each word.
self.lstm = nn.LSTM(embedding_dim+self.out_channels, hidden_dim, bidirectional=True)
python

Main model#

get_lstm_features ties the per-word pipeline together:

  • Look up character embeddings, run them through the CNN.
  • Concatenate the char-CNN output with the GloVe vector for the word.
  • Run the result through the BiLSTM.
  • Project to tag space with a linear layer.

Training#

SGD with learning rate 0.015 and momentum 0.9, matching the paper. Dropout and gradient clipping for regularization.

optimizer = torch.optim.SGD(model.parameters(), lr=0.015, momentum=0.9)
python

The loss is the negative log-likelihood of the gold tag sequence under the CRF:

def neg_log_likelihood(sentence, tags):
    feats = get_lstm_features(sentence)
    forward_score = forward_algorithm(feats)
    gold_score = score_sentence(feats, tags)
    return forward_score - gold_score
python

Evaluation#

Standard precision, recall, and F1 against the CoNLL 2003 test split. A sample prediction:

sentence = "Jay is from India."
prediction = model.predict(sentence)
print(prediction)
python
Jay: PER
is: NA
from: NA
India: LOC
yaml

What I took away#

A few things the workshop conversations clarified for me:

  • The single most leverage-bearing decision in reproducing this paper was matching the paper’s initialization choices exactly. Glorot for the LSTM, uniform-from-a-specific-range for the embeddings. Skip those and the convergence trajectory changes enough that you stop hitting the reported numbers.
  • The CRF layer matters more than its presentation in the paper suggests. With softmax instead, the model emits illegal tag sequences at a rate that costs about 1.5 F1 points on CoNLL test.
  • “Run on a fresh machine with one command” is a higher bar than it sounds. The original implementation depended on a specific version of an older framework; getting the same numbers in current PyTorch required rewriting more than just import statements.

The architecture itself has been mostly superseded by transformer-based NER, but the reproducibility lessons port directly to any modern paper.

Footnotes#

  1. Link to the original paper: End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

  2. Link to the repository: NER-LSTM-CNN-Pytorch

BiLSTM-CNN-CRF for Sequence Labeling: A Tutorial
https://theanig.dev/blog/ner-lstm-cnn-icml
Author Anirudh Ganesh
Published at August 25, 2018