Fine-Tuning SciBERT to Recognize Drug Names and Adverse Effects

November 2021

In this post, I go through a project I did for the Hugging Face (HF) Community Event on November 15-19, 2021. I used the 🤗 transformers library to fine-tune the allenai/scibert_scivocab_uncased model on the ade_corpus_v2 dataset. The fine-tuned model is able to perform Named Entity Recognition (NER) to label drug names and adverse drug effects.

To showcase the final result, I deployed an interactive Gradio app powered by the model on HF Spaces.

Introduction

According to Wikipedia, an adverse effect (or "AE") is "an undesired harmful effect resulting from a medication or other intervention." The objective of this project was to train a machine learning model that tags adverse effects in an input text sample. The text could come from scientific publications, social media posts, drug labels, and other relevant sources. The HF ecosystem, particularly the transformers and datasets libraries, made this task achievable in a matter of days and with low overhead. Model training was done on a GPU-enabled AWS p3.2xlarge instance, which was provided by AWS for the duration of the Community Event.

Preparing the Dataset

The first task was to prepare a dataset that could be used to train an NER model. The ade_corpus_v2 dataset on the HF Hub was an excellent starting point, featuring (in the Ade_corpus_v2_drug_ade_relation subset) thousands of example texts with labeled spans for not only AEs, but also drug names. Because of this, I was able to go beyond the original objective by training the model simultaneously to tag drugs.

However, the data could not used as-is, because drugs and AEs are identified one at a time, as opposed to all at once for a given sentence. Therefore, if a unique sentence contains multiple drugs or AEs, that sentence would appear multiple times in the dataset. For example, below is a unique sentence appearing repeatedly in this manner:

{'text': 'After therapy for diabetic coma with insulin (containing the preservative cresol) and electrolyte solutions was started, the patient complained of increasing myalgia, developed a high fever and respiratory and metabolic acidosis and lost consciousness.', 'drug': 'insulin', 'effect': 'increasing myalgia', 'indexes': {'drug': {'start_char': [37], 'end_char': [44]}, 'effect': {'start_char': [147], 'end_char': [165]}}}

{'text': 'After therapy for diabetic coma with insulin (containing the preservative cresol) and electrolyte solutions was started, the patient complained of increasing myalgia, developed a high fever and respiratory and metabolic acidosis and lost consciousness.', 'drug': 'cresol', 'effect': 'lost consciousness', 'indexes': {'drug': {'start_char': [74], 'end_char': [80]}, 'effect': {'start_char': [233], 'end_char': [251]}}}

{'text': 'After therapy for diabetic coma with insulin (containing the preservative cresol) and electrolyte solutions was started, the patient complained of increasing myalgia, developed a high fever and respiratory and metabolic acidosis and lost consciousness.', 'drug': 'cresol', 'effect': 'high fever', 'indexes': {'drug': {'start_char': [74], 'end_char': [80]}, 'effect': {'start_char': [179], 'end_char': [189]}}}
Therefore, despite containing 6821 rows, the dataset only contains around 4200 unique text samples. Having text samples repeat in this manner is problematic in an NER setting - if we treat each row as a unique datapoint, then we would confuse the model with contradictory labels for the same entities. For instance, the above example would be annotated like below :

After therapy for diabetic coma with <DRUG>insulin</DRUG> (containing the preservative cresol) and electrolyte solutions was started, the patient complained of <EFFECT>increasing myalgia</EFFECT>, developed a high fever and respiratory and metabolic acidosis and lost consciousness.

After therapy for diabetic coma with insulin (containing the preservative <DRUG>cresol</DRUG>) and electrolyte solutions was started, the patient complained of increasing myalgia, developed a high fever and respiratory and metabolic acidosis and <EFFECT>lost consciousness</EFFECT>.
In the first instance of the sentence, "insulin" is labeled as a drug, and "increasing myalgia" as an AE. In the second instance, neither of the entities from the first instance are labeled, and instead, "cresol" is labeled as a drug and "lost consciousness" as an entity. Therefore, if the model learns to label "insulin" as a drug after seeing the first instance, it would actually be penalized for doing so in the second instance, where "insulin" is not assigned a label at all.

In order to rectify this, I created a single set of labels for each sentence. I first grouped the rows of the dataset by sentence, then gathered all of the unique starting and ending indices for all drugs and AEs appearing in each sentence. Finally, I performed a single pass to tag all tokens in each sentence with the correct labeled entities. I followed the IOB tagging format to label each token. Since we have two entities ("drug" and "AE"), each token was assigned one of five possible labels:

B-DRUG - the beginning of a drug entity

I-DRUG - inside a drug entity

B-EFFECT - the beginning of an AE entity

I-EFFECT - inside an AE entity

O - outside any entity being tagged

This resulted in a new dataset of 4271 entries, with a total of 110,497 individual tokens after processing via the SciBERT tokenizer. The dataset was split 75-25 into training and test sets.

Fine-Tuning SciBERT

SciBERT is a pre-trained BERT model released by the Allen Institute for AI. It was specifically pre-trained on a large corpus of scientific publications. Pre-training a model entails training it on an objective designed to make the model learn the relationships between tokens in the training data. If the pre-training corpus comes from a specific genre (e.g. scientific publications, social media posts, source code, specific languages), then the model would learn the particulars of that genre.

Once a model is pre-trained, developers and researchers are able to train it further on other tasks, also known as "fine-tuning." In this setting, I added a token classification layer to the pre-trained SciBERT model using the AutoModelForTokenClassification class. Unlike pre-training, which is expensive, time-consuming, and requires a lot of data, fine-tuning on a well-aligned task is often a quick process. Here, the model only needed 3 epochs to reach a reasonable test set performance.

After the model was trained, validation metrics were computed on the held-out test set:

Category Precision Recall F1 Count
DRUG 0.923 0.966 0.944 1299*
EFFECT 0.805 0.873 0.838 1412*
Overall Tokens 0.861 0.918 0.888 27759
* These counts and their corresponding metrics are over entities, as opposed to tokens. The counts exactly correspond to the number of tokens of the given type with the prefix B-. By contrast, the third row is describing performance over tokens, including the non-entity class O.

The metrics were computed by the Trainer object using the seqeval Python package. seqeval is a re-implementation of the popular conlleval Perl script, which evaluates NER performance according to the specifications in Tjong Kim Sang & Buchholz (2000). It is important to note that the metrics are computed on an entity-level and not on a token-level, meaning (emphasis mine):

[p]recision is the percentage of named entities found by the learning system that are correct. Recall is the percentage of named entities present in the corpus that are found by the system. A named entity is correct only if it is an exact match of the corresponding entity in the data file (Tjong Kim Sang & De Meulder 2003).
Thus, it does not suffice for the model to correctly label part of an entity - it must correctly label all and only those tokens that comprise an entity in order to receive credit.

One interesting observation about the test set metrics is that the model performs better on DRUG entities than on EFFECT entities, despite there being more examples of effects than drugs in our training set. My guess is this is because commercial drug names tend to follow a formula and are thus synthetically generated, whereas adverse effects rely on words that only indicate an adverse effect in particular contexts. As we can see below, the length of the average DRUG entity in our test set is slightly shorter than that of the average EFFECT entity, although the latter skews right a bit more:

Once fine-tuning was complete, I uploaded the fine-tuned model to the HF model Hub. From there, anyone with the model name can import it into a pipeline object to perform inference on custom input:
from transformers import (AutoModelForTokenClassification, 

model_checkpoint = "jsylee/scibert_scivocab_uncased-finetuned-ner"
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=5,
                                            id2label={0: 'O', 1: 'B-DRUG', 2: 'I-DRUG', 3: 'B-EFFECT', 4: 'I-EFFECT'} 
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

model_pipeline = pipeline(task="ner", model=model, tokenizer=tokenizer)

print( model_pipeline ("Abortion, miscarriage or uterine hemorrhage associated with misoprostol (Cytotec), a labor-inducing drug."))

Conclusion

As a participant in the recent HF Community Event for the second part of their Transformer course, I modified a dataset of medical text and used it to fine-tune SciBERT on an NER task. To me, this experience demonstrated how easy HF has made it to train and share custom Transformer-based models for natural language processing. It also showed how pre-training vastly scales up the application of Transformer-based models in practice, by making it feasible to train high-performing models across a variety of NLP objectives at relatively low cost.

My final notebook from the event, which contains the data preprocessing pipeline outlined in this post, as well as the code to fine-tune the model, can be found on GitHub here.

References

