The Power and Pitfalls of Aggregate Data


Applied Clinical Trials

Applied Clinical TrialsApplied Clinical Trials-10-01-2011
Volume 20
Issue 10

Public to private switch gives CROs better chance to "fix" and transform themselves.

During a long getaway to northern Italy this summer, I was reminded again of modern life in this age of ubiquitous computing-staying connected didn't even merit an afterthought. Driving between mountaintop villages where cell phone coverage was often spotty, the satellite beacon GPS was always pulsing to guide our path through circuitous, pitched slaloms of narrow country roads. Wi-Fi access greeted my tablet at every hotel-including the remote agriturismo that housed us in an Arcadian vineyard-and almost every bar, restaurant, and café. While much of the old country seemed to be virtually unchanged from its likely appearance a century or more ago, these public meeting places previously reserved for social encounters or clandestine trysts now were filled with solitary intercourse between human and laptop, coffee and wine.

The only things I couldn't keep up with were those related to data stored on my left behind laptop-a pretty convincing argument for me to move further into the cloud. Just give me that remote data access plus a tablet with browser and I should get by pretty well.

Of course my data was in the form of business documents. Had I had a need to query clinical trials data, I would have quickly exceeded the capabilities of my tablet. But the analysis of the results of a clinical study is not something to be done on vacation anyway-as if the analyst would have been allowed to escape in the first place. Instead, the study team waits with bated breath to see what the data confesses once the blind has been broken. These results are reported, studied, debated, inserted into the common technical document and then on to the next study. Too often in research, the lifecycle of the knowledge gained by the study mostly ends there.

What a waste. While the industry today is investing many resources toward exploiting secondary uses of observational healthcare data, the knowledge we gain through the blood, sweat, and tears of clinical trials seems to remain mostly single purpose-to support a regulatory submission rather than to be a persistent knowledge asset combined with previous experience used for multiple analytical and exploratory purposes by a larger research community. Combining the data into a pooled resource to support extended exploratory analysis certainly sounds like a good idea, but it's simply too hard.

Not that the concept of aggregating data from a pool of clinical trials is entirely alien, since sponsors have long had to prepare an Integrated Summary of Safety (ISS) database as part of their submissions. An ISS includes a subset of the safety information included for multiple controlled clinical trials (including demographics, adverse events, laboratory results, and other relevant safety domains) using a common schema and coding dictionaries. This is typically created by the statistics organization, but might not be made available for other purposes to other researchers within a drug sponsor, much less those who are external.

Then what about the FDA, which receives a copy of these same databases? They still lack a common repository to provide easy access for reviewing such data, which thus generally limits the scope of their review to the submission at hand. The Janus data warehouse initiative of the past decade offered a taste of what might be done, but the project was handicapped because most incoming data still does not conform consistently to the CDISC SDTM standard, and seldom includes standard terminologies other than MedDRA. So it was quite difficult to get data into this repository, much less pull it out, and so Janus has never gone into production use. FDA is now working on designing a new Clinical Trials Repository which will pick up where the Janus project left off, and they have indicated their intentions to publish the data model they're developing.

So defining a consistent repository for these data is one important step. But populating it with data ready for scientifically sound analysis is quite another. The inherent complexity of clinical data combined with the one study at a time attitude means that the data just can't align easily for pooling-even the data for one sponsor's drug program much less many others. For example, a critical clinical endpoint such as "bleeding" may vary in its description, attributes, and thresholds from study to study (not to mention drug to drug) while using the same label. The nuances of these definitions are generally buried within the protocol document, rather than metadata. While the CDISC SDTM provides a structural model for representing data, it does not really ensure consistency of meaning, especially when most users currently map data retrospectively to SDTM, which is rather like sorting your recycling for pickup into paper and plastic. That may be better than dumping it all into a single bin, but just barely, since we're just deferring the more difficult sorting by type, color, etc. farther downstream (and hopefully not just to a landfill). A particularly striking example of a semantic web approach to removing ambiguity from language can be found at



So it's not enough just to address the structure. To handle this most challenging of problems, we have to have a much more comprehensive way to fully describe and represent data using rich metadata, complex datatypes, patterns, and models. One approach offered by Thomas Beale as part of the openEHR Foundation, is to represent scientific observations as a triplet (there's that semantic web concept again) of data, state, and protocol. The data contain the value of a precisely defined measurement itself- say systolic blood pressure value. The state indicates the patient condition at the time-was this at rest or after exertion? Was the patient standing, sitting, or supine? The protocol describes the method: what size of cuff, which arm was measured, what type of measuring device, did it occur after some other set of mandated procedures (such as a blood draw, which may have created anxiety in the patient?).

Ideally all of these aspects of a clinical observation would be specified in a machine-readable protocol which uses information objects fully characterized in a metadata repository, and carried through data collection and analysis as metadata, so that advanced clinical query tools in an aggregate data repository would be smart enough to recognize which particular measurements were scientifically consistent-and which do not line up sufficiently to allow them to be aggregated for a meaningful pooled analysis.

And even that does not get us all the way there-we have to be certain we understand the clinical subject population as well. Clinical trials are extremely choosy about their subjects, who must meet what can often be a very complex set of eligibility criteria. These criteria are adjusted from study to study, so joining records from one study to those of another can create an unbalanced mixture of separate homogeneous populations, which bears little resemblance to the real world. Here again, it should be possible for new aggregate data analysis tools to understand the eligibility criteria so they could identify studies where the populations are sufficiently similar to allow meaningful pooling of data once the characteristics of the study population are well described.

A further degree of complexity is introduced by the design of the protocols themselves, such as the scheduling of events, order and types of procedures, and definition of treatment groups (especially those that involve changes in treatment during the course of the study, such as crossover designs). Then there are the many assumptions that may have been made by the biostatisticians when analyzing the data, and the additional data elements they derive to facilitate analysis. It's nearly impossible for a human being to be able to absorb and retain sufficient information to make it possible to keep track of all of these variations especially when delving into an ocean of possibilities. These have to be represented within the data repository directly before they can be interpreted by query tools.

Making this transition requires adoption of a uniform information model, consistent data types, and common semantics by all studies in the repository-a level of harmonization which we haven't come close to achieving yet. Even those who are trying to solve this problem often break into separate competing camps with different approaches and perspectives-not unlike our political environment in the United States. But at least the range of variation is slowly receding, and the different camps are beginning to huddle more around the same campfires to continue the conversation in forums such as the Clinical Information Modeling Activity being led by Stan Huff of Intermountain Healthcare. So while we are still a great distance from achieving a lingua franca for medical research, we may have gotten to the stage where those who speak different languages are at least able to communicate basic concepts in rudimentary ways. This situation will only improve over time, as long as we provide a learning environment to capture the knowledge as it accumulates, and the proper infrastructure to share it at its fundamental atomic/elemental/molecular levels to all researchers to not just inform but to drive the design of new studies so that the data uncovered by that research can persist for many other secondary purposes across the extent of foreseeable future time.

So to use aggregated data with scientific prudence, we'll need much more metadata than we have today to supply the relevant context and parameters, systems that can keep track of all that, and tools that are smart enough to process it all with sufficient understanding to guide us toward making meaningful and defensible analyses.

Again, conducting analysis of research data is not exactly something to be done on the road, on an iPad, when one should really be on vacation, rather than crunching against a database. Unless, of course, crunching numbers in a database is your idea of a vacation. Maybe someday it will all be so easy, it will seem like one.


Wayne R. Kubick is Senior Director, Life Sciences Product Strategy at Oracle Health Sciences. He can be reached at

Related Videos
Related Content
© 2024 MJH Life Sciences

All rights reserved.