TheAnig

Back

Introduction#

If you’ve ever dipped your toes into natural language processing (NLP), you’ve probably come across tasks like part-of-speech (POS) tagging and named entity recognition (NER). These are the bread and butter of understanding text—things like figuring out whether “Apple” in a sentence refers to a fruit or a tech giant. But let’s be honest, the traditional ways of handling these tasks? Not exactly user-friendly. They usually involve manually crafting features and preprocessing data for hours. It’s tedious, time-consuming, and doesn’t always translate well when you try to apply it to new types of text.

That’s where a groundbreaking paper from ACL 2016 comes in: End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF by Xuezhe Ma and Eduard Hovy1. Don’t let the fancy title scare you—this paper is a game-changer. It proposes a neural network model that completely flips the script on the traditional approach. Instead of hand-picking features, it combines character-level representations (thanks to Convolutional Neural Networks, or CNNs) with word-level context (handled by Bi-directional LSTMs). The cherry on top? A Conditional Random Field (CRF) layer that ties everything together for structured predictions.

What’s so cool about this? Well, for starters, it’s an end-to-end model, which means it handles the entire process without requiring you to do all that annoying feature engineering. You just give it data, and it learns what it needs—automatically. It’s versatile too, working across a range of sequence labeling tasks.

And the results? Pretty impressive. This model hit 97.55% accuracy on POS tagging with the Penn Treebank WSJ dataset and scored a solid 91.21% F1 on NER using the CoNLL 2003 dataset. Those are state-of-the-art numbers, by the way.

Since its debut, this paper has left a big mark on the NLP community. It’s inspired a wave of follow-up research and implementations, all riffing on this idea of combining different neural network components into a single powerful system. This emphasis on end-to-end learning has made sequence labeling faster, easier, and way more adaptable.

Reproducibility#

This is why I took this paper as a good candidate to present for reproducibility2.

It’s a standout example of a complex yet impactful model, combining multiple neural network components into an end-to-end framework. The paper doesn’t just offer strong theoretical contributions; it also provides practical results that have influenced a significant portion of subsequent NLP research. By breaking down the BiLSTM-CNN-CRF model step-by-step, I wanted to demonstrate how a state-of-the-art system could be reproduced with modern tools like PyTorch, ensuring others could verify the results and, ideally, build upon them.

Reproducibility in this case meant more than just rerunning the experiments—it meant making the implementation accessible, understandable, and adaptable to different use cases. During my preparation, I paid close attention to the potential pain points: ensuring the code was platform-agnostic, tuning hyperparameters to match those used in the paper, and structuring the codebase so others could easily tweak it for related tasks. The aim was to not only replicate the model’s impressive benchmark performance but to encourage further experimentation and adoption by the community.

At the ICML workshop, this example opened up broader conversations about what it takes to make machine learning research truly reproducible. The discussion often circled back to the tools we use to share code and results. For instance, while frameworks like PyTorch offer flexibility and power, simply uploading a GitHub repository isn’t enough if it lacks proper documentation or doesn’t account for different computational setups. The group also explored the importance of including detailed training procedures and hyperparameter configurations—often overlooked but critical for achieving comparable results.

Tutorial#

Data Preparation#

Before we dive into the model, let’s prepare the data. We’ll use the CoNLL 2003 dataset, which contains text tagged for four types of named entities: PERSON, LOCATION, ORGANIZATION, and MISC.

The dataset uses the BIO tagging scheme, where:

  • B-TYPE indicates the beginning of an entity of type TYPE.
  • I-TYPE indicates the continuation of the entity.
  • O indicates that the word is not part of any entity.

For example,

U.N.         NNP  I-NP  I-ORG
official     NN   I-NP  O
Ekeus        NNP  I-NP  I-PER
heads        VBZ  I-VP  O
for          IN   I-PP  O
Baghdad      NNP  I-NP  I-LOC
plaintext

We also preprocess the text by replacing all digits with 0. This step ensures that the model focuses on meaningful textual patterns instead of numeric details, which are often irrelevant in NER tasks.

To align with the paper, we convert the tags to the BIOES scheme, which provides finer granularity by adding:

  • E-TYPE for the end of an entity.
  • S-TYPE for single-token entities.

Mappings and Word Embeddings#

To train our model, we need to convert words, characters, and tags into numeric representations. This mapping allows us to leverage efficient matrix operations in PyTorch. For word embeddings, we use pre-trained GloVe vectors (100-dimensional) trained on Wikipedia and Gigaword data. These embeddings provide a semantic representation of words, which significantly boosts the model’s performance.

word_to_id, id_to_word = create_mapping(train_sentences)
char_to_id, id_to_char = create_mapping(train_sentences, level='char')
tag_to_id, id_to_tag = create_mapping(train_sentences, level='tag')
python

We also load GloVe embeddings into a matrix and initialize missing embeddings randomly:

word_embeds = np.random.uniform(-np.sqrt(0.06), np.sqrt(0.06), (len(word_to_id), 100))
for word in word_to_id:
    if word in glove_embeds:
        word_embeds[word_to_id[word]] = glove_embeds[word]
python

Model Architecture#

The BiLSTM-CNN-CRF model has three main components:

  • CNN for Character Embeddings: Generates a character-level representation for each word using convolutional and max-pooling layers. This step captures morphological patterns like prefixes and suffixes.

  • BiLSTM for Word Encoding: Combines word embeddings (GloVe + character embeddings) and processes them bidirectionally. This helps capture context from both the left and right of each word.

  • CRF for Structured Prediction: Ensures that predictions respect tagging constraints (e.g., I-ORG cannot follow I-PER).

The model that we are presenting is a complicated one, since its a hybridized network using LSTMs and CNNs. So in order to break down the complexity, we have attempted to simplify the process by splitting up operations into individual functions that we can go over part by part. This hopefully makes the whole thing more easily digestable and gives a more intuitive understanding of the whole process.

Initialization of weights#

We start with the init_embedding function, which just initializes the embedding layer by pooling from a random sample.

The distribution is pooled from 3V-\sqrt{\frac{3}{V}} to +3V+\sqrt{\frac{3}{V}} where VV is the embedding dimension size.

    bias = np.sqrt(3.0 / input_embedding.size(1))
    nn.init.uniform(input_embedding, -bias, bias)
python

The LSTM layers are initialized by uniform sampling from 6r+c-\sqrt{\frac{6}{r + c}} to +6r+c+\sqrt{\frac{6}{r + c}}. Where rr is the number of rows, cc is the number of columns (based on the shape of the weight matrix).

    bias = np.sqrt(6.0 / (input_linear.weight.size(0) + input_linear.weight.size(1)))
    nn.init.uniform(input_linear.weight, -bias, bias)
python

CRF Layer#

We have two options:

  • softmax: normalize the scores into a vector such that can be interpreted as the probability that the word belongs to class. Eventually, the probability of a sequence of tag yy is the product of all tags.

  • linear-chain CRF: the first method makes local choices. In other words, even if we capture some information from the context thanks to the bi-LSTM, the tagging decision is still local. We don’t make use of the neighbooring tagging decisions. Given a sequence of words w1,...,wmw_1, ..., w_m, a sequence of score vectors s1,...,sms_1, ..., s_m and a sequence of tags y1,...,ymy_1, ..., y_m, a linear-chain CRF defines a global score CRC \in \R such that

C(y1,...,ym)=b[y1]+t=1mst[yt]+t=1m1T[yt,yt+1]+e[ym]C(y_1, ..., y_m) = b[y_1] + \sum_{t=1}^m s_t[y_t] + \sum_{t=1}^{m-1}T[y_t, y_{t+1}] + e[y_m]

where TT is a transition matrix in R9x9\R^{9x9} and e,bR9e, b \in \R^9 are vectors of scores that capture the cost of beginning or ending with a given tag. The use of the matrix TT captures linear (one step) dependencies between tagging decisions.

The motivation behind CRFs was to generate sentence level likelihoods for optimal tags. What that means is for each word we estimate maximum likelihood and then we use the Viterbi algorithm to decode the tag sequence optimally.

Advantages of CRF over Softmax:

  • Softmax doesn’t value any dependencies, this is a problem since NER the context heavily influences the tag that is assigned. This is solved by applying CRF as it takes into account the full sequence to assign the tag.
  • Example: I-ORG cannot directly follow I-PER.

A simple CRF network

A simple CRF network

The figure shows a simple CRF network, in our case we have the inputs feeding in from our BiLSTMs, but otherwise the structure largely remains the same.

Score Calculation#

CRF computes a conditional probability. Let yy be a tag sequence and x an input of sequence of words. Then we compute

P(yx)=exp(Score(x,y))yexp(Score(x,y))P(y|x) = \frac{exp(Score(x, y))}{\sum_{y'}exp(Score(x, y'))}

Where the score is determined by defining some log potentials logψi(x,y)log\psi_i(x, y) such that

Score(x,y)=ilogψi(x,y)Score(x, y) = \sum_i log\psi_i(x, y)
def log_sum_exp(vec):
    '''
    This function calculates the score explained above for the forward algorithm
    vec 2D: 1 * tagset_size
    '''
    max_score = vec[0, argmax(vec)]
    max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
    return max_score + torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))
python

This is the score function for our sentences. This function takes a list of ground truths that tell us what the corresponding tags are and the features which contains the supposed tagged parts of the function. Which is then used to compute the score.

def score_sentences(self, feats, tags):
    # tags is ground_truth, a list of ints, length is len(sentence)
    # feats is a 2D tensor, len(sentence) * tagset_size
    r = torch.LongTensor(range(feats.size()[0]))
    if self.use_gpu:
        r = r.cuda()
        pad_start_tags = torch.cat([torch.cuda.LongTensor([self.tag_to_ix[START_TAG]]), tags])
        pad_stop_tags = torch.cat([tags, torch.cuda.LongTensor([self.tag_to_ix[STOP_TAG]])])
    else:
        pad_start_tags = torch.cat([torch.LongTensor([self.tag_to_ix[START_TAG]]), tags])
        pad_stop_tags = torch.cat([tags, torch.LongTensor([self.tag_to_ix[STOP_TAG]])])

    score = torch.sum(self.transitions[pad_stop_tags, pad_start_tags]) + torch.sum(feats[r, tags])

    return score
python

Viterbi decode#

Viterbi decode is basically applying dynamic programming to choosing our tag sequence. Let’s suppose that we have the solution

s~t(yt)=argmaxyt,...,ymC(yt,...,ym)=argmaxyt+1st[yt]+T[yt,yt+1]+s~t+1(yt+1)\tilde s_t(y_t) = argmax_{y_t, ..., y_m} C(y_t, ..., y_m) = argmax_{y_{t+1}}s_t[y_t] + T[y_t, y_{t+1}] + \tilde s_{t+1}(y^{t+1})

Then, we can easily define the probability of a given sequence of tags as

P(y1,...,ym)=eC(yt,...,ym)Z\mathbb{P}(y_1, ..., y_m) = \frac{e^{C(y_t, ..., y_m)}}{Z}

Model Details#

CNN model for generating character embeddings#

Consider the word ‘cat’, we pad it on both ends to get our maximum word length ( this is mainly an implementation quirk since we can’t have variable length layers at run time, our algorithm will ignore the pads).

Convolution Model for generating character embeddings

Convolution Model for generating character embeddings

We then apply a convolution layer on top that generates spatial coherence across characters, we use a maxpool to extract meaningful features out of our convolution layer. This now gives us a dense vector representation of each word. This representation will be concatenated with the pre-trained GloVe embeddings using a simple lookup.

This snippet shows us how the CNN is implemented in pytorch

self.char_cnn3 = nn.Conv2d(in_channels=1, out_channels=self.out_channels, kernel_size=(3, char_embedding_dim), padding=(2,0))
python

LSTM model that generates tags for the given sequence#

The word-embeddings( glove+char embedding ) that we generated above, we feed to a bi-directional LSTM model. The LSTM model has 2 layers,

LSTMs for Tag Generation

LSTMs for Tag Generation

  • The forward layer takes in a sequence of word vectors and generates a new vector based on what it has seen so far in the forward direction (starting from the start word up until current word) this vector can be thought of as a summary of all the words it has seen.

  • The backwards layer does the same but in opposite direction, i.e., from the end of the sentence to the current word.

  • The forward vector and the backwards vector at current word concatanate to generate a unified representation.

This snippet shows us how the BiLSTM is implemented in pytorch

self.lstm = nn.LSTM(embedding_dim+self.out_channels, hidden_dim, bidirectional=True)
python

Main Model Implementation#

The get_lstm_features function returns the LSTM’s tag vectors. The function performs all the steps mentioned above for the model.

Steps:

  • It takes in characters, converts them to embeddings using our character CNN.
  • We concat Character Embeeding with glove vectors, use this as features that we feed to Bidirectional-LSTM.
  • The Bidirectional-LSTM generates outputs based on these set of features.
  • The output are passed through a linear layer to convert to tag space

Training the model#

We train the model using stochastic gradient descent (SGD) with a learning rate of 0.015 and momentum of 0.9. To avoid overfitting, we apply dropout and use gradient clipping

optimizer = torch.optim.SGD(model.parameters(), lr=0.015, momentum=0.9)
python

We also calculate the negative log-likelihood as our loss function:

def neg_log_likelihood(sentence, tags):
    feats = get_lstm_features(sentence)
    forward_score = forward_algorithm(feats)
    gold_score = score_sentence(feats, tags)
    return forward_score - gold_score
python

Evaluation and Testing#

After training, we evaluate the model using precision, recall, and F1-score. Here’s an example of the model tagging new sentences:

sentence = "Jay is from India."
prediction = model.predict(sentence)
print(prediction)
python

This gives us the output:

Jay : PER
is : NA
from : NA
India : LOC
yaml

… and that’s it! That gets us a state of the art NER pipeline!

Closing Thoughts#

Presenting this work at the workshop was an incredibly rewarding experience. It wasn’t just about showcasing the BiLSTM-CNN-CRF model or diving into the nitty-gritty of the implementation details in PyTorch, but seeing researchers and practitioners engage with this tutorial, ask thoughtful questions, and discuss their own challenges made all the preparation worth it.

If you have any thoughts, questions, or feedback, feel free to reach out—I’d love to hear from you!

Footnotes#

  1. Link to the original paper: End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

  2. Link to the repository: NER-LSTM-CNN-Pytorch

BiLSTM-CNN-CRF for Sequence Labeling: A Tutorial
https://theanig.dev/blog/ner-lstm-cnn-icml
Author Anirudh Ganesh
Published at August 25, 2018