Named Entity Recognition from scratch on Spanish medical documents (Part A)
Introduction
In this post I am sharing my approach from participating in the MedDoProf data science challenge. It is a named entity recognition challenge organized during Summer of 2021. I found it interesting because the articles are in Spanish. I don’t speak Spanish, and I thought it would be cool to approach it purely from a maching learning perspective. The documents of the challenge come from the medical domain. I have worked with eCommerce, CRM and academic datasets mostly in the past, so this is different. And perhaps it will help other open initiatives.
The data are available at Zenodo.
TL:DR Code for this blogpost on my github SpanishNER repo.
Plan
My plan for this challenge was simple:
- Get the data
- Build an end-to-end approach with a baseline model
- Iterate
When I started I was planning to
- use spacy for pre-processing
- use a linear CRF as a model
- iterate a bit on the features
- build a deep NN model and compare
- spend only a few hours or a couple of nights on this task
At the end:
- I managed to submit a working solution
- I did not build a deep neural network
I found myself spending most of the time to work between the different formats:
- Get the input format in a format suitable for NER
- Write the prediction in the way the organizers requested This proved to be more time-consuming that I originally thought due to the format. At the same time, it was a nice Python exercise.
In this first post I will discuss the data preparation from the format of the organizers to a format that can be used for learning. Recall, my plan is to use a linear CRF, so I need features in the “traditional” sense: information that describes the words. This can be the word itself, linguistic features (part-of-speech, lemmas, ..) and some morphological features (is number, position in text, capital letters, …).
I found spacy
to be excellent as it can do all these things with its nlp pipeline. Let’s get started!!
Loading raw data
The first goal is to load the data. There were 1,500 annotated documents released for training purposes.
Each document has a .txt
file with the document context and an .ann
file with the NER tags that may appear.
Here is an example (cc_onco972.txt
):
excerpt from the .txt
file
Historia oncológica
Presento el caso de una mujer de 67 años, caucásica, directora de banca como profesión, con dos hermanas, casada, que vive con su marido, y con una hija en común, independiente para las actividades de la vida diaria.
No presenta alergias ni hábitos tóxicos conocidos. Es hipertensa en tratamiento con enalapril/hidroclorotiazida 20/12,5 mg. No toma otra medicación de forma crónica salvo paracetamol ocasional si precisa por dolor. Ha sido intervenida quirúrgicamente hace más de 20 años de un bocio multinodular. No presenta otros antecedentes patológicos de interés.
excerpt from the .ann
file
T2 PROFESION 73 91 directora de banca
From the .ann
we see that there is an entity of type “Profession” from characters 73 to 91, that contain the text “directora de banca”.
The goal is to get the text and the annotations in a workable format.
def get_raw_data(sourcepath):
"""Loads data as released and creates a dictionary containing
both the text and the annotation. The dictionary structure is
data[some_document_id]["text"]
data[some_document_id]["ann"]
sourcepath: directory containing the .txt and .ann files
"""
data = defaultdict(dict)
for filename in Path(sourcepath).iterdir():
root, suffix = filename.stem, filename.suffix # Given cc_onco972.txt, keeps cc_onco972
file_content = Path(filename).read_text()
if suffix == ".txt":
data[root]["text"] = file_content
if suffix == ".ann":
data[root]['ann'] = transform_entity_file(file_content)
return data
def transform_entity_file(labels):
"""Gets an entity file contents (.ann)
and returns a list of dictionaries.
Each dictionary is an entity with its
label, begin index, end_index and actual text.
"""
document_entities = []
for label in labels.split("\n"): # For each IOB tag
if label:
_, tag_and_indexes, passage = label.split("\t")
tag, start_idx, end_idx = tag_and_indexes.split()
this_entity = {'label': tag,
'begin_idx': int(start_idx),
'end_idx': int(end_idx),
'passage': passage
}
document_entities.append(this_entity)
return document_entities
The code snippet is hopefully self-explaining. Nice parts include the usage of pathlib.Path
which makes handling paths easier and
scales across operating systems. Path().stem, Path().suffix
are also very useful for dealing with files and much less verbose than their os
counterparts.
df = get_raw_data("/Users/gbalikas/Downloads/meddoprof-training-set/task1/")
print(len(df))
df["cc_onco972"]["ann"]
retuns:
1500
[{'label': 'PROFESION',
'begin_idx': 73,
'end_idx': 91,
'passage': 'directora de banca'}]
which is what we were targeting.
Feature extraction
At this point we have the data, we should proceed with some feature extraction with spacy. Analyzing all 1,500 documents takes some time (few minutes). To avoid waiting for this every time, the idea is to extract the features once, then persist to disk. Also, this approach is generic in the sense that it does not leak any information. What I mean here is that the steps are very generic (POS, lemmatization, ..) and can perform them both to training and test data without the risk of data leakage.
def get_training_format(data):
"""Training format generator. The format looks like:
data = {
"doc_id": {
"tokens": [...], # actual token
"ner": [...], # IOB tag of token
"pos": [...], # part of speech of token
"dep": [...], # dep. parsing tag of token
"lemma": [...], # lemma of token
"sid": [...] # token start index in document
}
}
"""
new_data = defaultdict(dict)
nlp = spacy.load("es_core_news_sm")
for _, root in enumerate(data.keys()):
if 'ann' in data[root]: # True for training, test do not have .ann files
tokens, pos, ner, dep, lemma, sid = get_tokens_features(data[root]['text'],
nlp,
data[root]['ann'])
else:
tokens, pos, ner, dep, lemma, sid = get_tokens_features(data[root]['text'], nlp)
new_data[root]['tokens'] = tokens
new_data[root]['ner'] = ner
new_data[root]['pos'] = pos
new_data[root]['dep'] = dep
new_data[root]['lemma'] = lemma
new_data[root]['sid'] = sid
return new_data
def get_tokens_features(text, nlp, ner_tags=None):
"""Uses spacy to extract features"""
tokens, pos, ner, sid, dep, lemma = [], [], [], [], [], []
doc = nlp(text)
for token in doc:
tokens.append(token.text) # tokenizes with spacy
pos.append(token.pos_) # part-of-speech
sid.append(token.idx) # start index of the token
dep.append(token.dep_) # dep parsing
lemma.append(token.lemma_) # lemma of the token
ner = ["O"]*len(tokens) # In case there are no entities
if ner_tags: # Evaluates to true for training data
for id_, token in enumerate(tokens):
# For each NER tag we have, validate if the current word is an entity
# using the start and the end indexes of the data
for label in ner_tags:
if (sid[id_] >= label['begin_idx']) and (sid[id_] < label['end_idx']):
# If the current token in between the tag limits, get IOB tag it!
ner[id_] = 'B-' + label['label'] if sid[id_] == label['begin_idx']\
else 'I-' + label['label']
return tokens, pos, ner, dep, lemma, sid
I also hope the code here is clean in what it does. An interesting part is the last lines of the get_tokens_features
where we use the indexes of the entities to get IOB tags. The idea is that if between the start and end indexes of an entity, it should be IOB-tagged.
If the first index of the word is the first index of the entity, the the word should be tagger with the begin tag too, e.g., B-PROFESSION. This can become a bit harder than it looks
if the documents are not clean (several encodings) or if the tokenizer does not provide the same limits with the golden data. Luckily, I did not have to face these problems here.
Having this functions, we can:
data = get_training_format(df) # df is the data we loaded with `get_raw_data` above
So at this point we have the data in a format we can use to train a Linear CRF. Here is the format:
data = {
"doc_id": {
"tokens": [...], # actual token
"ner": [...], # IOB tag of token
"pos": [...], # part of speech of token
"dep": [...], # dep. parsing tag of token
"lemma": [...], # lemma of token
"sid": [...] # token start index in document
}
}
Conculsion
At this point, we have the data pre-processed and we have extracted several useful linguistic features. I will cover training and evaluating a Linear CRF model in the next post.
The code is in my SpanishNER git rep. When publishing the code, I tried to follow some good engineering and reproducibility practices:
- Using virtual environments
- Writing unit tests and executing with
pytest
. They can also serve as documentation of the functions above. - Having an argument parser so that the user can control aspects of the code
I plan to provide a bit more details on why I believe these are also important and the extra effort and time is not wasted there.