Big Data, Information and Meaning

Kubick,Wayne;

Big Data, Information and Meaning

February 1, 2012

By Wayne Kubick

Article

Applied Clinical Trials

Applied Clinical TrialsApplied Clinical Trials-02-01-2012

Volume 21

Issue 2

Vendors have developed systems for massive databases, but are we data ready?

I often wonder how people with jobs like me find time to read books anymore. By the time I've gotten through a day of meetings, e-mail, and business reading, there's not much time to absorb anything else other than a newspaper or a favorite periodical. My solution has been to listen to audio books during my early morning workouts. I usually confine myself to non-fiction business, science, and history books to help maintain a semblance of self-improvement, even though listening to a book does not quite achieve the same level of understanding as reading by eye.

Wayne R. Kubick

Once in a while, I pick up a book that is so compelling or complex that I find myself having to go back and purchase a print edition afterwards so I can read it the traditional way. This past holiday, such a book was "The Information: a History, a Theory, a Flood" by James Gleick. Gleick captivated me as he traced the history of communication, writing, messaging, cryptography and, ultimately, computing from ancient times through pioneers like Charles Babbage, Claude Shannon, and Alan Turing. It was surprising to me that the term "information" did not really exist prior to the middle of the 20th century—when people probably had the time to not only read real books, but also write letters and converse in leisure—and how critical information theory was in leading the way to computers, the Internet, and today's wireless computing world.

As Gleick portrayed the transition from the spoken to written word, and from paper to bits, I found myself wondering how we're progressing as we handle information in the world of clinical research and development.

One new trend to consider in this light is "big data," which refers to databases, measured in terabytes and above, that are too large and complex to be used effectively on conventional systems. Big data has attracted big vendors who have developed powerful new systems that combine massively parallel hardware and software to quickly process and retrieve information from such immense databases. In our world, big data solutions have been mostly employed to date in bench-research applications. In such cases, scientists have already gained experience in how to represent molecular, genomic, proteomic, and other complex, voluminous data types well enough so that they can benefit directly from the speed of processing and retrieval of big data appliances.

But it seems that such systems would also be very useful for examining large observational healthcare databases of millions of patients to try to identify and explore safety signals. Yet it is extremely challenging to meaningfully merge and combine such data into a single research database, because the content, context, and structure of such data from different sources is so heterogeneous. Continual movement toward electronic healthcare records, together with advancements in standards and systems may get us closer eventually, but the path will likely continue to be long and tortuous. And the current business model of either tapping into a data provider's system one at a time, or downloading local copies of each data source, coupled with the risks of maintaining privacy and compliance compounds the problem.

Projects like OMOP, Sentinel, and EU-ADR are raising interest in exploring healthcare data, and available data sources and tools are improving all the time. For example, the UK government has recently announced its intention to make available National Health Service data to the R&D community. Yet while projects like OMOP reflect cross-industry, cooperative efforts to better understand methods and develop open source tools, sponsors are still locked into creating their own local copy of a research database by contracting with individual data providers one at a time and building their own data repository.

It would be much more logical and efficient if the industry could work together to make such data of common interest available to all—as a public research "big data commons in the cloud," which would eliminate the need for everyone to set up their own local environment populated by the same external data sources over and over again. Of course it would certainly be challenging to establish an effective cooperative legal/business model and infrastructure to serve the interests of many different stakeholders, including pharmaceutical manufacturers, researchers, regulators, and even healthcare providers and payers.

As daunting as this seems, it's even more of a stretch to create such a big data commons for non-clinical and clinical study data resources from products in development, which have traditionally been treated as highly proprietary intellectual property assets that sponsors guard closely. In this case the pooling of data is confounded not only by the lack of standardization in data structures and terminologies, but also by the need to understand differences in study context which is recorded in the clinical study protocol, still typically an unstructured text document. So it's difficult to justify a big data scenario because such data are not readily available (although FDA, continues to try to build its own clinical trials repository in the latest iteration of the Janus program), as well as so hard to pool together. CDISC has helped here by providing standard representations for many commonly used data domains, but these standards did not originally extend to the specific data elements and efficacy outcomes associated with each individual therapeutic area.

This too is changing, as has been demonstrated by the Critical Path Institute, which has put together a research study database of Alzheimer's treatments in CDISC format contributed by many different sponsors. This model of using core CDISC metadata standards extended to incorporate data elements specific to a disease to define a single way to represent data for studies in a specific therapeutic area, together with providing a database of actual study data available to the research community is now being extended to many more diseases. CPath thus offers a prototype of a commons, though hardly a big data scenario yet.

Defined metadata standards for such projects describe the concepts, elements, terminologies, and relationships between them for research in a specific therapeutic area. In a sense, a structured protocol consists of metadata about a trial, some of which may be recorded in a registry like clinicaltrials.gov. Such information, and the more complex aspects of a protocol such as the treatment plan and the procedural and data collection times and events must also be captured and bound to the actual data so that researchers can understand the nuances of each trial that might affect the understanding of the data pulled from a big data repository.

The concept of metadata can be extended to address data provenance as well. In clinical research, provenance involves the ability to trace the differing states and chain of custody as data moves from patient through investigator, CRO, sponsor, and regulator—from origin through analysis. The NIH-sponsored National Children's Study (NCS), a large-scale, prospective observational study to examine factors that can affect the health of our nation's children as they develop into adulthood, is taking a particularly ambitious and innovative approach to recording provenance. This project recognizes that scientific concepts and data standards will evolve over time, so its ambitious metadata repository seeks not just to describe the metadata about study elements, but also metadata to describe the transformation of concepts and research practices over time. It will be interesting to see how this project itself learns and evolves over its expected lifecycle of more than two decades, and will probably want to accommodate many innovative changes in technology during that lifespan.

Now, much of the information NCS (or any clinical trial) needs is the same information that would be logically present in observational data sources, and wouldn't it be exciting if our hypothetical data commons for observational healthcare data could be utilized side by side with our standardized repository of non-clinical and clinical trials as well. To do this we need common semantics, and common ways of modeling patient data between these two worlds. This reverie is unlikely to be achieved by any single system of global standards, so it must find a way to equate various clinical models and map terminologies to at least enable the possibility of interoperability. One great hope to achieve this vision is the Clinical Information Modeling Initiative (CIMI), a global cooperative effort being led by Stan Huff, MD, of Intermountain Healthcare. CIMI is working to define a common reference model and base set of terminologies for representing health information content that will enable health information to flow from one standard representation to another both within the world of healthcare and onward to clinical research.

We need big data, big metadata, and big ideas. To achieve this we need to bring together all available data in a way that expresses a common meaning, irrespective of its original form. Because in the end, knowledge comes from consistent information, and, as Gleick recognizes, it should not matter whether it is spoken, or written, or transmitted in many different forms—the meaning of the message must be one if we are to effectively learn from it.

Wayne R. Kubick is Chief Technology Officer for the Clinical Data Interchange Standards Consortium (CDISC). He resides near Chicago, IL, and can be reached at wkubick@cdisc.org.

Download Issue PDF

Articles in this issue