How NLP, OCR, and Social Graphs Accelerate Subject Recruitment for Clinical Trials


Explore how natural language processing and social graph techniques help to tackle the challenge of patient and investigator recruitment and raise the success rate of clinical trials.

Only 14% of all clinical trials are successful, state the authors of an article1 published in Biostatistics in 2019. According to comparable studies2, the chances of success for a compound entering trials is even lower. In the end, regardless of whether the truth is closer to 10% or 14%, the risk-reward calculus for clinical trials is dicey and it is hard to find another major business type operating under such a high failure rate.

A crucial challenge that many companies specializing in this area are facing on a regular basis is finding a sufficient number of patients that are willing to participate in clinical trials. This is reflected in a study3 published by Grant D. Huang in Science Direct, which concludes that a staggering 86% of clinical trials do not reach their enrollment targets within a predefined time period. Consequently, researchers obtain no knowledge whether their drug is safe and effective. Qualified patients may not be identified within the narrow trial eligibility window and as a result miss the potential opportunities of trying cutting-edge medications.

There are a number of challenges faced by contract research organizations (CROs) during the doctor and patient recruitment stage, such as:

  • Data disparity. A huge part of all health data regarding the patients is stored in different document formats (electronic medical records (EMR), laboratory information systems (LIS), genomic, .pdfs from third-party labs, etc.) and is not integrated into a single database. Sometimes the data is scanned as an image and the attached document is not text-searchable, which requires extra time for the investigators to open it and understand what it is about.
  • Sites effort. Enrolling a single patient for a clinical trial may take 9 or more hours of screening time. Many clinical trial sites manage numerous clinical trials for various disease types at the same time, which means they administer data on thousands of patients simultaneously. With different inclusion and exclusion criteria for every clinical study, as well as complex clinical study designs (double-blind, cross-over, factorial, cohort studies, etc.), it is harder to find suitable patients that qualify for a trial in the selected time-period.

    In addition, investigators often overestimate the realistic number of patients that can be recruited into a study and a lot of doctors overestimate how many patients will actually want to participate in a study.
  • Patient effort. Many patients don't want to participate in a clinical trial and opt to receive a standard treatment as they don't want to spend time on testing a product that might not work or even harm them. What's more, the time to travel to the medical site and the amount of online and offline paperwork a patient needs to complete is time-consuming and scares many potential candidates away.

Structuring the available data, making it easily searchable, and finding connections between principal investigators, patients, and clinical organizations is another clinical trial bottleneck, especially when the study is conducted across different geographical locations. Structured data enables clinical organizations to be proactive and anticipate its needs and optimize the business efficiency. For example, a multi-site investigator might not be able to see all the subjects going through different parts of the clinical trial enrollment process, still, they have to know all the details about when the patient was enrolled and what the observations regarding them are, in case the research team needs information about a specific patient. With the help of data science and NLP, investigators and CROs can easily structure and see all the information regarding patients and their conditions. 

In order to be able to quickly invite doctors-influencers, who have written about a certain topic to participate in a corresponding clinical trial, clinical research organizations (CROs) can use NLP techniques to identify the connections between different document authors and find out who is the most influential author on a specific topic. Furthermore, data science and NLP methods help to evaluate which patients fit the specified inclusion criteria.

But what is NLP actually?

How NLP processes big data in clinical trials

NLP is an area of artificial intelligence and computational linguistics. It is focused on using computer power to analyze natural language, namely all text information, in order to identify patterns, names, and other entities. NLP can process speech, clusterize text by topics, extract relationships between objects, classify documents, and much more. With the aid of NLP, the data from disparate sources can be unified, labeled, and structured. Data scientists can integrate EMR, LIS, and lab tests data into one database and process it with the help of NLP, so it delivers more value to clinical organizations.

There are six NLP techniques which are frequently used while working with unstructured medical data:

  • Named entity recognition extracts entities from the text documents, such as names of organic molecules used as a compound of a drug, doctors’ names, institutions, locations, and contact details. For instance, named entity extraction helps to detect the doctors’ names conducting research on a certain medical topic. 
  • Semantic parsing conducts the syntactic analysis of the text and its parts. Semantic parsing determines the meaning of each text part and the relationship between these text parts.
  • Topic modeling identifies possible topics used in different text units. It helps to better understand, categorize, and clusterize text units and discover hidden patterns in the data.
  • Keyword extraction defines the set of most often used keywords and keyphrases within a dataset. These keywords serve as data points to better understand the voluminous amounts of text.
  • Document summarization puts together short essences of different texts and text parts. The key intent of conducting document summarization is to automatically identify what sentences represent the core of the document or several documents.
  • Relationship extraction defines the relationships between objects within a dataset. Based on relationships between doctors’ names, for example, a social graph of doctors-influencers can be created. Similarly, the research that mentions the names of certain organic molecules can create a map of possible solutions for certain diseases. 

To maximize the outcomes from everyday work, therewith to get a competitive advantage, many leading life sciences companies have already implemented NLP techniques.

“In the clinical domain, researchers have used NLP systems to identify clinical syndromes and common biomedical concepts from radiology reports, discharge summaries, problem lists, nursing documentation, and medical education documents. Different NLP systems have been developed and utilized to extract events and clinical concepts from text (…). Success stories in applying these tools have been reported widely”, says an article4 that evaluates the applications of clinical information extraction, issued in the Journal of Biomedical Informatics.

How OCR can help in clinical trials

A huge amount of medical data is still stored in a non-editable format, such as typewritten medical notes, text on images or printed out documents. Extracting this information often is an important aspect of clinical trials, as the data can provide valuable evidence regarding the drug that is tested. 

However, it can be time-consuming and laborious for a doctor to go through and review individual records. Optical character recognition (OCR) techniques help to digitize printed texts, such as PDFs, so they can be electronically editable, searchable and usable for further analysis.

After OCR, text mining NLP techniques can be used to extract specific features or objects from the scanned photocopies. Thanks to advantages over the years, automatic text mining today is not only less bothersome, but also more consistent and reliable, detecting 3%–14% additional feature instances compared to manual checks, according to research5 that assesses text-mining-assisted extraction of pathology features from scanned clinical records and was published in BMJ Open in May 2020.

Overall, OCR and text mining facilitate prompt and accurate abstractions that can be used to speed up clinical research processes.

Speeding up patient recruitment using social graphs

To speed up the time needed to find appropriate investigators and patients for a specific clinical trial, CROs can fall back on an approach that all kinds of brands are using when marketing their products: the social graph technique. The term refers to a method of data analysis derived from using social networks to find influencers; people engaging with the largest and most relevant audience on social media.

The most famous social graph is the one created by Facebook, connecting its 2.7 billion monthly users.6 For the pharmaceutical industry, a social graph can be built to show the connections between different doctors that conduct research on a specific topic. On the graph CROs and sponsors can easily see what investigators they have already invited to participate in a clinical trial and the ones that have not yet been invited.

This approach is very helpful because influential principal investigators, or Key Opinion Leaders (KOLs) play a vital role in pharma research, development, and the marketing of new products. Fact-based identification and engaging with the right KOLs can influence the quality of partnerships, pharma business objectives, and a medication’s overall life cycle.

Identifying top-tier trial sites

One of the most important aspects of a clinical trial is selecting high-functioning investigator sites because they can dramatically affect product approval, study costs, and timelines. Too often, however, the identification system for sites is not very mature. As a consequence, the decision of whether a site is deemed suitable is often simply based on whether the necessary infrastructure and know-how to fulfill the activities specified in the clinical study protocol are available. That is why only one-third of all sites manage to attract enough patients, with many of them falling considerably short or not even enrolling a single participant. 

To improve the process of finding clinical trials that perform well, it makes sense to include criteria such as an investigator’s expert status (e.g., how many articles has he published and how often are they quoted) or prior experience in clinical trials with similar treatments. Other critical factors could, for instance, be the site’s location and its previous success rates in enlisting subjects, the proximity of comparable studies, or the epidemiological data of the specific patient population. Information like this can be gathered by combining targeted database population, electronic health records, insurance databases, prescriptions, and so on and using NLP techniques to make sense of the data’s semantic relationships. 

A semantic relationship could, for instance, be asking the solution for sites at which an advanced kind of brain surgery is performed. The system can then gather all relevant sites and a site-scoring algorithm can automatically rank them according to parameters such as the frequency of this special operation, the expert-status of the responsible doctor, the overall site experience with this procedure, or former enrollment rates. The value of this approach is the accurate prediction about the site’s match and the huge time savings for researchers, who do not have to do this work manually.

Addressing the fundamental challenge with NLP

At the present day, far too much data that is contained in medical records, health documents, questionnaires, publications, articles, or other documents and could be used to improve clinical trials, remains untouched. However, when sorted, labeled, cleared, and analyzed, this data can be used to gain trailblazing insights.

NLP techniques that can be applied to unstructured medical data include named-entity recognition and topic clustering. These techniques help to identify the needed entities and automatically segment the texts to the predefined categories, which in turn can mean tremendous time savings for researchers.

The combination of NLP and social graphs helps to raise the success of clinical trials by addressing the fundamental challenge of investigator and patient recruitment. NLP techniques leverage the power of unstructured data to quickly match CROs with resourceful doctors and eligible patients. By utilizing the power of relationships between data items, the investigators that have researched a specific topic can be quickly identified. A similar technique can be used to identify top-tier trial sites. To get the full advantage out of NLP, social graph and impact factor algorithms, clinical organizations can utilize the help of proficient product development outsourcers.

Igor Kruglyak is a Senior Advisor at the global IT service provider Avenga. Michael DePalma is the Founder and President of Pensare, LLC; Co-Founder of


Related Content
© 2024 MJH Life Sciences

All rights reserved.