Named Entity Recognition from scratch on Spanish medical documents (Part A)

7 minute read


In this post I am sharing my approach from participating in the MedDoProf data science challenge. It is a named entity recognition challenge organized during Summer of 2021. I found it interesting because the articles are in Spanish. I don’t speak Spanish, and I thought it would be cool to approach it purely from a maching learning perspective. The documents of the challenge come from the medical domain. I have worked with eCommerce, CRM and academic datasets mostly in the past, so this is different. And perhaps it will help other open initiatives.

The data are available at Zenodo.

TL:DR Code for this blogpost on my github SpanishNER repo.


My plan for this challenge was simple:

  • Get the data
  • Build an end-to-end approach with a baseline model
  • Iterate

When I started I was planning to

  • use spacy for pre-processing
  • use a linear CRF as a model
  • iterate a bit on the features
  • build a deep NN model and compare
  • spend only a few hours or a couple of nights on this task

At the end:

  • I managed to submit a working solution
  • I did not build a deep neural network

I found myself spending most of the time to work between the different formats:

  • Get the input format in a format suitable for NER
  • Write the prediction in the way the organizers requested This proved to be more time-consuming that I originally thought due to the format. At the same time, it was a nice Python exercise.

In this first post I will discuss the data preparation from the format of the organizers to a format that can be used for learning. Recall, my plan is to use a linear CRF, so I need features in the “traditional” sense: information that describes the words. This can be the word itself, linguistic features (part-of-speech, lemmas, ..) and some morphological features (is number, position in text, capital letters, …).

I found spacy to be excellent as it can do all these things with its nlp pipeline. Let’s get started!!

Loading raw data

The first goal is to load the data. There were 1,500 annotated documents released for training purposes. Each document has a .txt file with the document context and an .ann file with the NER tags that may appear. Here is an example (cc_onco972.txt):

excerpt from the .txt file

Historia oncológica
Presento el caso de una mujer de 67 años, caucásica, directora de banca como profesión, con dos hermanas, casada, que vive con su marido, y con una hija en común, independiente para las actividades de la vida diaria.
No presenta alergias ni hábitos tóxicos conocidos. Es hipertensa en tratamiento con enalapril/hidroclorotiazida 20/12,5 mg. No toma otra medicación de forma crónica salvo paracetamol ocasional si precisa por dolor. Ha sido intervenida quirúrgicamente hace más de 20 años de un bocio multinodular. No presenta otros antecedentes patológicos de interés.

excerpt from the .ann file

T2	PROFESION 73 91	directora de banca

From the .ann we see that there is an entity of type “Profession” from characters 73 to 91, that contain the text “directora de banca”. The goal is to get the text and the annotations in a workable format.

def get_raw_data(sourcepath):
    """Loads data as released and creates a dictionary containing
     both the text and the annotation. The dictionary structure is

    sourcepath: directory containing the .txt and .ann files
    data = defaultdict(dict)
    for filename in Path(sourcepath).iterdir():
        root, suffix = filename.stem, filename.suffix  # Given cc_onco972.txt, keeps cc_onco972
        file_content = Path(filename).read_text()
        if suffix == ".txt":
            data[root]["text"] = file_content
        if suffix == ".ann":
            data[root]['ann'] = transform_entity_file(file_content)
    return data

def transform_entity_file(labels):
    """Gets an entity file contents (.ann)
     and returns a list of dictionaries.
    Each dictionary is an entity with its
    label, begin index, end_index and actual text.
    document_entities = []
    for label in labels.split("\n"):  # For each IOB tag
        if label:
            _, tag_and_indexes, passage = label.split("\t")
            tag, start_idx, end_idx = tag_and_indexes.split()
            this_entity = {'label': tag,
                           'begin_idx': int(start_idx),
                           'end_idx': int(end_idx),
                           'passage': passage
    return document_entities

The code snippet is hopefully self-explaining. Nice parts include the usage of pathlib.Path which makes handling paths easier and scales across operating systems. Path().stem, Path().suffix are also very useful for dealing with files and much less verbose than their os counterparts.

df = get_raw_data("/Users/gbalikas/Downloads/meddoprof-training-set/task1/")


[{'label': 'PROFESION',
  'begin_idx': 73,
  'end_idx': 91,
  'passage': 'directora de banca'}]

which is what we were targeting.

Feature extraction

At this point we have the data, we should proceed with some feature extraction with spacy. Analyzing all 1,500 documents takes some time (few minutes). To avoid waiting for this every time, the idea is to extract the features once, then persist to disk. Also, this approach is generic in the sense that it does not leak any information. What I mean here is that the steps are very generic (POS, lemmatization, ..) and can perform them both to training and test data without the risk of data leakage.

def get_training_format(data):
    """Training format generator. The format looks like:
        data = {
            "doc_id": {
            "tokens": [...], # actual token
            "ner": [...], # IOB tag of token
            "pos": [...], # part of speech of token
            "dep": [...],  # dep. parsing tag of token
            "lemma": [...], # lemma of token
            "sid": [...] # token start index in document
    new_data = defaultdict(dict)
    nlp = spacy.load("es_core_news_sm")
    for _, root in enumerate(data.keys()):
        if 'ann' in data[root]:  # True for training, test do not have .ann files
            tokens, pos, ner, dep, lemma, sid = get_tokens_features(data[root]['text'],
            tokens, pos, ner, dep, lemma, sid = get_tokens_features(data[root]['text'], nlp)
        new_data[root]['tokens'] = tokens
        new_data[root]['ner'] = ner
        new_data[root]['pos'] = pos
        new_data[root]['dep'] = dep
        new_data[root]['lemma'] = lemma
        new_data[root]['sid'] = sid
    return new_data

def get_tokens_features(text, nlp, ner_tags=None):
    """Uses spacy to extract features"""
    tokens, pos, ner, sid, dep, lemma = [], [], [], [], [], []
    doc = nlp(text)
    for token in doc:
        tokens.append(token.text)  # tokenizes with spacy
        pos.append(token.pos_)  # part-of-speech
        sid.append(token.idx)  # start index of the token
        dep.append(token.dep_)  # dep parsing
        lemma.append(token.lemma_)  # lemma of the token
    ner = ["O"]*len(tokens)  # In case there are no entities
    if ner_tags:  # Evaluates to true for training data
        for id_, token in enumerate(tokens):
            # For each NER tag we have, validate if the current word is an entity
            # using the start and the end indexes of the data
            for label in ner_tags:
                if (sid[id_] >= label['begin_idx']) and (sid[id_] < label['end_idx']):
                    # If the current token in between the tag limits, get IOB tag it!
                    ner[id_] = 'B-' + label['label'] if sid[id_] == label['begin_idx']\
                        else 'I-' + label['label']
    return tokens, pos, ner, dep, lemma, sid

I also hope the code here is clean in what it does. An interesting part is the last lines of the get_tokens_features where we use the indexes of the entities to get IOB tags. The idea is that if between the start and end indexes of an entity, it should be IOB-tagged. If the first index of the word is the first index of the entity, the the word should be tagger with the begin tag too, e.g., B-PROFESSION. This can become a bit harder than it looks if the documents are not clean (several encodings) or if the tokenizer does not provide the same limits with the golden data. Luckily, I did not have to face these problems here.

Having this functions, we can:

data = get_training_format(df)  # df is the data we loaded with `get_raw_data` above

So at this point we have the data in a format we can use to train a Linear CRF. Here is the format:

data = {
            "doc_id": {
            "tokens": [...], # actual token
            "ner": [...], # IOB tag of token
            "pos": [...], # part of speech of token
            "dep": [...],  # dep. parsing tag of token
            "lemma": [...], # lemma of token
            "sid": [...] # token start index in document


At this point, we have the data pre-processed and we have extracted several useful linguistic features. I will cover training and evaluating a Linear CRF model in the next post.

The code is in my SpanishNER git rep. When publishing the code, I tried to follow some good engineering and reproducibility practices:

  • Using virtual environments
  • Writing unit tests and executing with pytest. They can also serve as documentation of the functions above.
  • Having an argument parser so that the user can control aspects of the code
    I plan to provide a bit more details on why I believe these are also important and the extra effort and time is not wasted there.