Mining for Words: The Case for NLP

April 1, 2011
Wayne Kubick

Applied Clinical Trials

Applied Clinical Trials, Applied Clinical Trials-04-01-2011, Volume 20, Issue 4

Natural language processing could change the way we interpret documents and data.

In the world of clinical trials, we spend a great deal of time designing case report forms (CRFs) to record the specific data points we want to collect for a protocol. It's important to design CRFs with the goal of populating a database that will need to be analyzed, so CRF designers try to collect categorical data values through value lists, Boolean responses, or numeric measurements as much as possible. As a general rule we eschew free-form text responses to questions, because these can't be automatically analyzed-you need a human being with some domain experience to read, interpret, and decide what to do with such responses. In cases when we do allow free text-as in the verbatim description of an adverse event (AE)-we subsequently code this text with dictionary-derived standard values that can be systematically analyzed in the database.

Now in the paper CRF world, it's possible for an investigator to scribble additional free-text anywhere on a page, and many FDA reviewers would scour CRFs for evidence of notes hidden in the margins that might contain insight-or even infer an adverse event that never was never directly described on the AE page. EDC limits this ability to ad lib-but most systems still allow the use of comments so an investigator can record information that doesn't quite fit into the pre-defined slots on the electronic form. Since a comment can contain literally anything, these still need to be closely reviewed-just in case they involve a significant finding or inference. FDA reviewers have also expressed their intense interest in gaining access to other possible safety-relevant information that may be lurking in electronic health records (EHRs) but never made it to the CRF, though we still have many technological, legal, and political barriers inhibiting that. Much of the data in EHRs is currently recorded as free-form text in the form of notes jotted down by a physician (sometimes on scanned paper)-not as structured data that would easily fit into a relational database for systematic analysis.

There's quite a lot of free text information floating around these days-IDC has estimated that more than 1,000 exabytes of digital data are being produced around the globe each year. (For those of you who care about such things, 1,000 exabytes = 1 zettabyte = 1021 bytes-or far more than myriad human minds can ever imagine, much less read). Much of this is health related-an estimated 10 billion patient records have already been created-and some of that might even be crucial to health decisions, if only it was available when needed.

In the particular case of pharmacovigilance, currently only an estimated 10 percent of all drug related adverse events are actively reported as cases to manufacturers or regulators. What about the others? Are there possibly patient notes in health records that never get transcribed into a MedWatch form? Are patients discussing their side effects in public web forums, blogs, specialty sites like, or even tweets? How would we know? And if we did know, should manufacturers have to collect, report, and process these like other case reports? If so, how will they keep up?

Michael Ibara, a fellow evangelist for integrating EHRs with clinical research and safety, has predicted that the cost of acquiring digital safety information will drop significantly as the industry catches up with technology, which follows the evolutionary trend of information from atoms into bits that Nicholas Negroponte described in Being Digital. Moreover, as cost drops, typically volume increases. It's been said that FDA is already anticipating that they'll have to begin to monitor such websites for potential safety information in the future. Yet they are already stretched, processing over one million AE reports each year-how would they cope if millions more were suddenly beginning to appear?



Natural language processing

Which is why the topic of Natural Language Processing (NLP) comes to mind. NLP is loosely defined as computer understanding, analysis, manipulation, or even generation of natural language communication. Many of us saw IBM's Watson star on Jeopardy recently, when he (or at least his NLP male voice) adequately bested two breathing humans by understanding, interpreting, and responding to the Jeopardy answers with dazzling speed. NLP is Hal, or your cell phone's voice command feature, and a major piece of search engines like Google which often have an uncanny ability to understand what you really meant-even if that wasn't exactly what you typed.

We have many needs to better apply NLP in life sciences research to help us master hoards of scanned paper or electronic free text to find out specific useful information, such as adverse events, symptoms, diagnoses, and drugs taken in observational data, or medical history, which is generally unstructured even in clinical trials.

Fortunately, there has been a great deal of academic research on how to do this. Even with traditional electronic case reports, crucial information may be lurking in the narrative which may not be correctly represented in the structured data elements of the case report. It would be helpful to use NLP to extract and compare such information to help validate the consistency and completeness of a report.

NLP research

I recently attended a lecture at FDA by Professor Carol Friedman of Columbia University, whose research uses NLP to review published literature, clinical study reports, and EHR text to identify possible contraindications, symptoms, diagnoses, and adverse events to support safety surveillance.

A separate lecture by University of Illinois Professor Catherine Blake described a method using semantic weblike triplet structures to perform meta-analysis by extracting product claims from published literature, including the proper handling of modifiers (so as to differentiate, for example, differences in claims for studies performed on rats vs. humans).

Another example was a paper presented by Martijn J. Schuemie at last year's ISPE conference titled "Automated Classification of Free Text Electronic Health Records for Epidemiological Studies," an EU-ADR funded project that used text data mining of narratives to extract diagnoses and adverse events from medical records.

Some of my CDISC friends have been involved in the Strategic Health IT Advanced Research Projects Area program (SHARPn, which is funded by ONC) to develop tools to influence and extend secondary uses of healthcare data. Among other things, SHARPn has created an NLP annotation tool called cTAKES. cTAKES is already adept at identifying drug utilization information in text and is currently being evaluated to discern smoking status and side effects. Some of the available NLP tools used by some of the researchers I've mentioned include METIS, Anni 2.0, and MedLEE , among many others.

In addition, commercial solutions are also becoming available. Along with the usual big players are intriguing new companies like First Life Research which has developed technology to understand patient language and transform it into actionable content.

Other uses for NLP

There are many other relevant life sciences applications for NLP beyond safety. For example, while we are finally getting closer to creating a standard for structured protocols, our past and current clinical trial protocols are simply free-text documents. NLP could be used to extract structured data elements from published protocol documents-to support meta analysis and provide context for the clinical data that could be further explored retrospectively (as with the FDA's current comparative effectiveness project using a new clinical data repository). Stanford University, whose NLP group has been leading researchers for years, developed its protocol disambiguator tool to look for inconsistencies between narratives and structure using NLP.

Of course, interpreting the meaning of language by computer is still somewhat of an art as well as science-see, for instance, the amusing but unprintable website devoted to archiving classic missteps of the iPhone autocorrect text feature. Computers can't always tell when they interpret something that doesn't make sense-or even might cause offense. But the research has shown accuracy of up to 90 percent in some cases-possibly as good or better than tired, overworked humans. Despite a couple of embarrassing goofs, Watson comported himself pretty well on Jeopardy too.

NLP is not a new concept. Indeed, it dates back to Alan Turing and his famous test in 1950. There have been many funded research efforts in the decades since. But there's a time to move beyond research and begin to incorporate such science into applied use in the real world. In life sciences, we have many cases where we could employ NLP to help us interpret mountains of documents and extract structured information to increase our range of knowledge. It's about time we did so.


Wayne R. Kubick is Senior Director, Life Sciences Product Strategy at Oracle Health Sciences. He can be reached at

download issueDownload Issue : Applied Clinical Trials-04-01-2011

Related Content: