Evaluation of Different NLP Models for Parsing and Extraction of Clinical Data from Scientific Articles


Natural language processing models are able to automate the extraction of pertinent information from a vast array of scientific literature.

Image credit: AREE | stock.adobe.com

Image credit: AREE | stock.adobe.com

In the realm of medical research, the extraction of relevant clinical data from scientific articles is a critical yet challenging task. This study presented by Therapyte evaluates various natural language processing (NLP) models automate the extraction of pertinent information about chronic obstructive pulmonary disease (COPD) from a vast array of scientific literature.


A collection of illnesses collectively referred to as COPD result in airflow obstruction and breathing difficulties, examples of which include emphysema and chronic bronchitis. Approximately 36.5 million people with COPD in Europe have breathing difficulties and millions more people either have undiagnosed COPD or are not receiving treatment.

At the same time, the rapid growth in medical literature has made it necessary to employ efficient tools for extracting relevant clinical data. Therefore, this study focuses on identifying and evaluating different NLP tools for analyzing literature on COPD, aiming to streamline the data extraction process and enhance the accuracy of information retrieval.1,2


The research involved multiple stages:

A. Extraction of Articles. During this stage, articles and data were filtered based on the classifications of diseases found in different sources, namely PubMed, Google Scholar, and clinicaltrials.gov.

B. Selection of Articles. Additional criteria were applied at this stage: treatment options, age, type of disease, gender, and stage of clinical trials.

C. Entity and Relationship Recognition. The initial quality of selected models was not high enough and did not satisfy the goals of the project. In order to increase quality, models were trained on specifically prepared datasets. These datasets were developed using INCEpTION for marking the entities and their relations. The collected parameters included spirometry, alpha-1 testing, oximetry, arterial blood gas, dosage of target medication, and drug type.

D. Fine-Tuning and Training. At this stage, models were trained on prepared datasets and then fine-tuned in order to maximize the effectiveness of the models. The following NLP models were tested in the project: ClinicalTransformer, SynSPERT, BioBERT, EHR, RoBERTa, Electra and GPT Models.

E. Inference and Evaluation. Fine-tuned models were applied to all articles, results were assessed, and if the results were unsatisfactory, stages C and D were repeated.

Challenges Encountered

Our investigation faced several challenges:

A. Data Presentation in Tables: Automated parsing of table-formatted data proved difficult, necessitating manual intervention.

B. Clinical Event Detection: Differentiating between clinical events and improvements was not significantly effective across models.

C. Rare Parameters: The rarity of certain medical terms complicated the training process of the models.


The current project showed that models worked better with particular types of tasks. No universal model was observed to be the best for all types of tasks.

Nevertheless, the BioBERT model pre-trained on medical domain data showed the best results for parsing entities with an average F1 score of 0.75. The quality of entity extraction depended heavily on the entity frequency in the articles, which in turn influenced the quality of training of datasets. For instance, an F1 score was much higher for frequent entities reaching the result of 0.9.

For the parsing of popular entities relations, ClinicalTransformers performed the best, attaining an F1 score of 0.83. This model was augmented by SynSPERT, which had the best performance in recognition of unique relations and demonstrated an F1 score of 0.76.

Conclusion and Future Directions

At the time of concluding the project, the BioBERT and ClinicalTransformers models showed the highest degree of accuracy. GPT models (pre GPT 3.0) were only used for specific tasks in the study, because they were not so widespread at the time. The widespread implementation of GPT models in various tasks seems to be highly promising in the parsing of clinical data from scientific articles.


1. Adam Benjafield, Daniela Tellez, Meredith Barrett, Rahul Gondalia, Carlos Nunez, Jadwiga Wedzicha, Atul Malhotra European Respiratory Journal 2021 58: OA2866; DOI: 10.1183/13993003.congress-2021.OA2866

2. Chronic Obstructive Pulmonary Disease (COPD). Center for Disease Control and Prevention. Retrieved December 20, 2023, from https://www.cdc.gov/copd/index.html

About Therapyte

Therapyte offers a wide-range of Real World Evidence (RWE) solutions with access to millions of electronic health records. Therapyte is developing unique AI algorithms to collect curated data derived from millions of electronic health records (EHRs). In-house designed tools empowered with AI employed to clean, curate, bridge, harmonize, validate Real-World Data to generate evidence for comprehensive clinical research projects. To mine valuable datasets company is building the broad network of healthcare organizations and data providers from EU, CIS, MENA and APAC. More information can be found at its website: therapyte.com

© 2024 MJH Life Sciences

All rights reserved.