OR WAIT 15 SECS
We collect too much data, burdening investigators, monitors, and data managers.
Spring arrived early this year, and then went away just as quickly. But while the temperatures re-chilled in April, the trees had already begun to leaf. Along the edge of my property lies a stretch of woods, which has formed a definitive border for our yard and a convenient place to dump yard waste and wooded debris from past winters. Yet in that early spring, as we enjoyed a few weeks in the unseasonably warm sun, we noticed that jutting behind the ring of emerging, spindly buckthorn that dominated the woods were oaks and maples—trees that we'd barely noticed in past years.
So it was time to clear the buckthorn. European Buckthorn (Rhamnus cathartica), is an invasive, woody plant that will dominate any northcentral landscape if left unchecked. A single plant can sometimes be trained to form an interesting specimen with careful pruning, but you can't train it to remain in solitude. One plant soon becomes a grove, which soon becomes the woods, which provide shielding from wind and sight but are suitable only as a screen to dump behind. And even that soon becomes a challenge, as the buckthorn will soon form a forest of thorns that even Prince Phillip's sword would find formidable to penetrate.
Wayne R. Kubick
So we engaged a local contractor to clear cut a substantial section of the woods. This exposed the trees, which were indeed mighty but a bit emaciated as it turned out —having also been cramped by the pushy buckthorn despite their greater girth. The clear-cutting was only a temporary solution. We could now reclaim more land for the moment, but hundreds of aspiring hidden roots and threatening shoots still remained, and, if left to their own device over another season or two, a new wall of buckthorn would appear all over again. You see, you can't just cut it off—you have to painstakingly dig out the roots, and then watch for and eradicate new shoots over and over again.
We have to clear the buckthorn in our business lives as well. I recently changed jobs and, rather than just copy my old files from one PC to another, I decided to make it an opportunity to switch to a Mac and start all over. But this wasn't so easy, because buried in the many gigabytes of past documents and e-mails were things that I occasionally needed to refer to again. It wasn't as easy as simply digging out the roots—I had to keep a small side garden for these old files, just in case.
Alas, there's a lot of buckthorn in our clinical data.
We collect too much data, and we make too little use of what we collect. Each data item we collect places more burden on investigators, monitors, and data managers, since our quality standards require us to ensure that case report forms (CRF) data be accurate and complete. And cluttering up the essential data creates noise that obscures more effective data reuse downstream.
As with buckthorn, the only way to avoid infestation is to keep the weeds out in the first place. In research, this can be accomplished in the protocol, which specifies the data to be collected at a high level. But when the high level data collection plan in the protocol is translated into CRFs and a data collection system, some of the weeds—the legacy of past, similar studies—manage to sneak their way back in.
In fact, the problem begins even further upstream. In the clinical development plan, a decision is made to develop a drug as a treatment for a particular indication. But there is a body of knowledge already published about the indication and what data or biomarkers are deemed to be important or necessary to collect during clinical studies testing treatments for that indication. This knowledge history already has been accumulated from the prior work of researchers in the scientific and academic community, as well as those from the pharmaceutical and biotechnology industry. Typically, this information is expressed as common data elements or variables that are collected on CRFs. Some pharmaceutical companies or research institutions have tens of thousands of such elements, many of which are more or less equivalent to each other, though perhaps with slight variations in meaning, labels, format, or attributes. Past experience tends to result in an accumulation of clinical concepts, since it is usually easier to create a new concept rather than figure out which existing one to use that may not seem quite right for the purpose at hand. And this is assuming that it's easy enough to find the ones that are already defined and understand them sufficiently to recognize that equivalence. Meanwhile, these concepts get translated onto CRFs—even if they are not always necessary or directly pertinent to the protocol at hand. We need to be wary of the weeds.
And often the same medical concepts get expressed differently to different audiences, which adds to the confusion. Some of these medical researchers speak in the languages of healthcare service delivery, encoded in systems such as SNOMED-CT and ICD-9, while industry researchers tend to think in terms of research dictionaries like MedDRA. So the same medical concept may be expressed in different vocabularies—which proliferates the clutter because you can't easily translate one to another because of the different structure of the coding systems.
And science marches on, creating yet more variation even when describing the same essential things. This is the nature of the scientific method—we conduct experiments, learn from the experience, and then apply that learning to the next study by creating a new protocol with a different design and modified data points we want to examine. But when we look at medical observations that occur during the conduct of primary patient care delivery, are they really the same as protocol-specified measurements taken on a precise schedule according to a research protocol? They may be called the same thing in some cases (though often not), but the parameters, conditions, and constraints specified may vary with each new trial. Yet there's value in knowing when two observations are the same, or when they're similar except for certain attributes, or when they're entirely different.
Identifying and understanding the essential core set of data elements that comprise research in a specific disease or therapeutic area is not an easy task. Clinical experts tend to have strong opinions of what's important, and imprecise language can be deceptive. The process of identifying this core set requires extensive interaction among a broad user community, and visual tools such as mind maps and domain analysis models to describe and express the concepts in a way that the broader scientific community can understand—and buy into.
If we could get a better handle on the essential concepts we need for research in each therapeutic area, we might also begin to learn certain metrics such as which data elements are most often used for which types of studies on which kinds of drugs for the exploration of different medical conditions. Which ones are most typically used as primary or secondary endpoints? How often are they used? What types of information are needed for a complete safety profile for certain indications? Which are not particularly useful? In other words, which of those green shoots springing up from the soil are more likely to be cherished annuals and perennials, and which are the weeds we want to pull out? These are some of the things we'd want to expect from the learning healthcare system that the Institute of Medicine has been envisioning.
Such a problem should be addressed on a global basis, through a shared knowledge base. All of our scientific research concepts and data elements should be stored in a comprehensive metadata repository where they can be accessed and applied on new studies by the entire global research community. Such a repository would identify essential, known clinical data elements; explain what they mean and how and where they are used; bind them to common value lists of controlled terminology; describe how they relate to various data standards and information models; express relationships on how elements fit together on a CRF, a data file or other packages for use in research; and identify equivalent and similar concepts used elsewhere in healthcare as well as research. The repository could even be used to facilitate data mapping to standard formats, even though that is not as desirable as defining the correct elements up front.
Of course, it may be too utopian a vision to accomplish all that right now, but why shouldn't we know at least as much about the fundamental data elements and concepts essential to conduct high quality research as we do about, say, shopping for groceries? Shouldn't healthcare and research be at least as much of a learning system as the retail marketplace?
We need to identify these core clinical data elements that are pertinent to each therapeutic area as well as the most important elements necessary to evaluate safety. We need to trace many of these back to the same concepts used between doctor and patient at the point of care. We need to understand how these all relate together—how they are discussed in the physician-patient meeting; encoded in healthcare systems for orders and billings; collected on CRFs; represented in databases; and analyzed for clinical study reports and later, aggregate analyses. And we need rich, comprehensive metadata that ensures we understand the nuances of these concepts and can trace that flow effectively from end to end. We need to concentrate on these specimens for our garden of knowledge—the ones that we really want to see. Maybe then we can finally start to understand the wonders that are presently obscured among all the buckthorn.
Wayne R. Kubick is Chief Technology Officer for the Clinical Data Interchange Standards Consortium (CDISC). He resides near Chicago, IL, and can be reached at firstname.lastname@example.org.