Digital Data, the Semantic Web, and Research

August 1, 2011

Applied Clinical Trials

Applied Clinical Trials, Applied Clinical Trials-08-01-2011, Volume 20, Issue 8

The positives of "going digital" are becoming more and more apparent for clinical research.

My newspaper didn't arrive this morning, so I had to view it on my iPad. As a result, I read it much more quickly, since it was easier to focus only on what I found truly interesting. The print version has been wasting away over the past year anyway, a mere shadow of its former self, denuded of most of the ads that used to comprise much of the printed bulk. This week's printed Time Magazine was a tad more than a pamphlet, though it appears much more substantial in the iPad app, which doesn't readily betray how thin the printed content would be.

Wayne R. Kubick

We've been paying our bills online for some time now, though a number of our creditors still insist on sending us a paper invoice. I can't remember the last time I saw someone use a leather-bound day-timer either—unless it included a pocket for their smartphone. Then there are those bulky yellow pages books—which we threw right into the recycling last time since it's so much easier to look up phone numbers on Google.

My employer has an automated app for just about every administrative business process from training, to performance reviews, to expense reports. In fact, expense reports are pretty much the only things that still involve paper—for receipts, which I need to scan. I much prefer getting e-receipts sent right to my e-mail address, since they save me part of that effort.

Then there's my toxic wasteland of a desk. Whenever I search for anything on my computer, odds are I'll find what I want. But I can't tell you how many scraps of paper on my desk I've lost. I have a friend whose desk is always clean. Every time he gets a scrap of paper, he either moves it to disk or just chucks it. I've never searched through his computer, but his office is as pristine as an operating room table.

Now, let's segue to clinical research. Thanks to HITECH and other global initiatives, more and more physicians are implementing electronic healthcare record systems, and entering patient data as well as prescriptions directly into the system. Lab tests are conducted by devices, with reports output electronically. Regulatory-relevant documents are almost always created on a computer, and submitted to regulators in pdf and xml electronic formats. Yet CRF data is still typically first recorded on some sort of paper source document before being entered into a system. So why exactly do we still feel a need to use so much paper in clinical research?

Paper's more secure, some say. Well, except in the rubble of an unexpected disaster, or when you just lose it. And what about those disturbingly confidential patient lab reports that are somehow mistakenly sent to my home fax several times a week?

Paper's better for audits, others feel. Yet how difficult is it to edit a document and print out as many variations of it as you need to address various tiny issues? Even before the days of image enhancement tools, a touch of Wite-out and a copy machine were able to do wonders to tweak a paper document.

Paper's necessary to meet regulatory requirements, many persist in believing. But 21 CFR 11 is over a decade old, and ICH e6 Guidelines for Good Clinical Practice don't include a single reference to paper—though there are some 40 references to "records" which might be electronic or not.

So does source data and regulated information really have to be on paper? Or does it just have to be able to meet some of the same requirements as paper documents in a credible fashion? In other words, can we reproduce the digital version of such information on demand for review, and can we verify who authored it, when, and that it wasn't changed—without a paper crutch?

To find paper, we need to index and file it. But a physical piece of paper can only be filed in one place at a time, whereas a digital piece of information can be linked in many ways to many different catalogs, or even concepts. Digital information can also be broken down to items, elements, and things that can be linked with many other things.

From digital info to semantic web

Which brings us to the semantic web and its role in clinical research. The semantic web is already all around us—the number of available data sources in the semantic web has been doubling every 10 months since 2007. But we just don't always notice it when we interact through web browsers to access it. The World Wide Web is a way to link and access documents (which we characterize as containing unstructured data), which is fine for many purposes, but not usually sufficient for working in depth with structured data. Relational databases are one way to work with structured data, but they don't handle the unstructured stuff so well, and they tend to be confined in organizational silos and esoteric in design unless they are represented in common formats, organized around common models, and populated with common vocabularies that are consistently understood.

The semantic web, which is sometimes referred to as Web 3.0 or the Meaningful Web, is a way to deal with this. It makes it possible for both computers and people to find things or objects on the web—and know what they mean. In some ways it's reminiscent of artificial intelligence—once objects can describe themselves with rich context and relationships it's possible that computers will begin to develop understanding heretofore limited only to the human brain. As such, it's one practical path forward to what Ray Kurzweil describes as singularity.

The fundamentals of the semantic web seem simple enough: data objects are represented in triples in subject-verb-predicate form, and each of these data elements is tied to uniform resource identifiers (URIs) which can be looked up on the web and linked (hence becoming linked data or hyper data). These triples are often described as graph objects, much like a diagrammed sentence. A particularly useful link is the Same As link, which allows equivalent concepts or objects to be connected. But it quickly gets more complicated after that—and far beyond this author's level of understanding. There are plenty of resources on the web to consult for the details, and one well known example of the semantic web in action is the DBpedia project, which is a community effort to extract information from Wikipedia and make it available through the semantic web. DBpedia illustrates how you can access the collective knowledge of the web about a particular object of interest instead of relying on an individual database.

But maybe we don't need to worry about the details of the semantic web ourselves—we should first concentrate on the information that belongs there.

It's the content, stupid

Now the semantic web may not be the best solution for everything—but it's ideally suited for persistent data and especially metadata that are likely to be reused or referenced by the broader research community as an ongoing source of knowledge via the web. In the world of clinical research, this would apply to research and clinical concepts that are translated into data elements in protocols or on CRFs. Ideally every instance of every trial would access a single, gold standard repository to describe each aspect and procedure of a protocol, formulate each question on a CRF, and represent each data item in the same way, so that it would be finally possible to combine, compare, and study the vast wealth of information we collect in a consistent and scientifically meaningful way. Representing metadata and vocabularies this way becomes even more impactful once we as an industry begin to embrace cloud computing, with the promise of all information stores being accessible over the net (to properly authorized users of course).

We've used standard vocabularies to describe some of the concepts used in healthcare such as general clinical terminology (SNOMED), adverse events (MedDRA, and before that COSTART), diagnoses and procedures (ICD-9, CPT), and many others. Unfortunately, different vocabularies are used for different purposes by different organizations, and it's not always easy to translate one into another. What's more, the complex information that comprises healthcare can't be fully expressed in a manner that allows for an unambiguous computer understanding. There's a lot of hard work to identify and define concepts, organize and map across different terminologies and ontologies, and specify data elements and values.

Now the medical informatics community has been discussing these matters for years, and many current initiatives are in progress to help define medical and clinical concepts so that computers can understand them. This goal is being pursued through multiple initiatives, including the ONC SHARP AREA 4 program, CDISC SHARE, HL7 Fresh Look initiative, the ISO/CEN Detailed Clinical Models (DCM), the European Innovative Medicines Initiative knowledge management programs, and others. Of course, it would be helpful if all of these activities converged to form a single international gold standard of concepts that could provide services to all in a semantic web infrastructure, but this might be expecting too much all at once.

We can all contribute to becoming digital by vowing to forego paper, treating electronic information with control, and using available standards, vocabularies, and metadata before inventing our own. And we should begin to explore the potential of the semantic web for supporting such a universal knowledge repository that we'll all need eventually.

Wayne R. Kubick is Senior Director, Life Sciences Product Strategy at Oracle Health Sciences. He can be reached at

download issueDownload Issue : Applied Clinical Trials-08-01-2011

Related Content: