Big Data, Information and Meaning

Vendors have developed systems for massive databases, but are we data ready?
Feb 01, 2012
Volume 21, Issue 2

Wayne R. Kubick
I often wonder how people with jobs like me find time to read books anymore. By the time I've gotten through a day of meetings, e-mail, and business reading, there's not much time to absorb anything else other than a newspaper or a favorite periodical. My solution has been to listen to audio books during my early morning workouts. I usually confine myself to non-fiction business, science, and history books to help maintain a semblance of self-improvement, even though listening to a book does not quite achieve the same level of understanding as reading by eye.

Once in a while, I pick up a book that is so compelling or complex that I find myself having to go back and purchase a print edition afterwards so I can read it the traditional way. This past holiday, such a book was "The Information: a History, a Theory, a Flood" by James Gleick. Gleick captivated me as he traced the history of communication, writing, messaging, cryptography and, ultimately, computing from ancient times through pioneers like Charles Babbage, Claude Shannon, and Alan Turing. It was surprising to me that the term "information" did not really exist prior to the middle of the 20th century—when people probably had the time to not only read real books, but also write letters and converse in leisure—and how critical information theory was in leading the way to computers, the Internet, and today's wireless computing world.

As Gleick portrayed the transition from the spoken to written word, and from paper to bits, I found myself wondering how we're progressing as we handle information in the world of clinical research and development.

One new trend to consider in this light is "big data," which refers to databases, measured in terabytes and above, that are too large and complex to be used effectively on conventional systems. Big data has attracted big vendors who have developed powerful new systems that combine massively parallel hardware and software to quickly process and retrieve information from such immense databases. In our world, big data solutions have been mostly employed to date in bench-research applications. In such cases, scientists have already gained experience in how to represent molecular, genomic, proteomic, and other complex, voluminous data types well enough so that they can benefit directly from the speed of processing and retrieval of big data appliances.

But it seems that such systems would also be very useful for examining large observational healthcare databases of millions of patients to try to identify and explore safety signals. Yet it is extremely challenging to meaningfully merge and combine such data into a single research database, because the content, context, and structure of such data from different sources is so heterogeneous. Continual movement toward electronic healthcare records, together with advancements in standards and systems may get us closer eventually, but the path will likely continue to be long and tortuous. And the current business model of either tapping into a data provider's system one at a time, or downloading local copies of each data source, coupled with the risks of maintaining privacy and compliance compounds the problem.

Projects like OMOP, Sentinel, and EU-ADR are raising interest in exploring healthcare data, and available data sources and tools are improving all the time. For example, the UK government has recently announced its intention to make available National Health Service data to the R&D community. Yet while projects like OMOP reflect cross-industry, cooperative efforts to better understand methods and develop open source tools, sponsors are still locked into creating their own local copy of a research database by contracting with individual data providers one at a time and building their own data repository.

It would be much more logical and efficient if the industry could work together to make such data of common interest available to all—as a public research "big data commons in the cloud," which would eliminate the need for everyone to set up their own local environment populated by the same external data sources over and over again. Of course it would certainly be challenging to establish an effective cooperative legal/business model and infrastructure to serve the interests of many different stakeholders, including pharmaceutical manufacturers, researchers, regulators, and even healthcare providers and payers.