Multi-Trial Data Integration

Applied Clinical Trials

Supplements-02-02-2005, Volume 0, Issue 0

The benefits offered by the iCDW begin with a single trial by accelerating the statistical analysis plan.

Drug and medical device developers face an important fork in the road concerning data management for clinical trials. For years they have dropped information gathered by case report forms (CRFs) into data "silos"—database applications designed for one trial only. Their focus has been strictly on completing the trial's analysis so they can submit their new products for FDA approval as quickly as possible. There has been little concern for the future usefulness of the raw data.

After less than a decade of such purely tactical data management, the typical clinical data repository appears as disorganized as most people's attics. However, life sciences companies have begun to realize the enormous, even strategic, value of clinical data integrated across trials. When properly collated, this collection of raw data becomes an asset worth billions in cost savings and additional revenues. Yet before one can turn this data into dollars, someone must go through the attic and "organize the boxes." This means placing each trial's data into an integrated "clinical warehouse" designed to eliminate the trial boundary separating valuable pools of data, so it can be innovatively accessed and mined.

Though database technology makes building such a repository possible, it is still an effort expensive enough to require a careful business case. This article outlines that business case, sketching the major action steps to creating the integrated clinical data warehouse (iCDW). More importantly, it details the large rewards that make such an investment unquestionably worthwhile.

Clinical data value chain

For life sciences companies, data management plays a central role in converting the expense of clinical trials into large revenue streams. Once drug and device developers have identified a new product, they must still gain regulatory approval. The FDA grants its coveted approval only when a company can soundly argue that data from multiple clinical trials proves their new product is both safe and effective. The cost of the necessary series of clinical trials has grown steadily; bringing a new drug to approval now requires between $350 and $800 million. But that drug is often worth billions in revenue when it is finally released to the market.

Figure 1 depicts the clinical data "value chain" that begins with the clinic trial capturing raw data collected on CRFs. To turn the CRF data into a regulatory approval filing, the life sciences company must execute a "statistical analysis plan" (SAP) where biostatisticians assemble the raw numbers into analytical data sets and feed them to statistical programs, which generate hopefully compelling evidence of efficacy and acceptable side effects. The product developers submit the resulting analyses and supporting raw data to the FDA along with the design of the SAP, so that the agency can validate the developer's claim. Life sciences companies must repeat this process several times as the potential product progresses its way through multiple Phase I, II, and III clinical trials, and again for Phase IV trials whenever they want to re-label existing products for additional uses or to meet regulatory requirements for postmarketing studies.

Figure 1. Clinical data value chain and X-trial integration benefits.

As we will see, the iCDW will have several impacts upon this value chain, beginning with increased speed of execution for the SAP. Efficiencies gained in the SAP save time and expense, but, more importantly, the iCDW works "upstream" to abbreviate trial designs. Protocols involving integrated data require smaller patient populations, causing the trials to require less calendar time. Ultimately, combining data from multiple trials can increase its statistical power. That allows firms to reveal and win approval for additional applications of their products, eliminating the need for additional clinical trials designed solely to gain FDA approval for such expanded indications.

Second generation clinical data warehouses

Before detailing the value of cross-trial integration, let us first demonstrate that it is possible. The standard clinical data handling tools today are a mainstream database, such as Oracle, and an analytical engine, typically SAS. With structures and routines custom-developed for a given trial, this combination very capably transforms a single trial's data (once captured) into the analytical datasets and aggregated results required for product approval. However, a "first generation" approach stovepipes the data from CRF straight to the submission report. Although the trials may be co-located in something called a "clinical warehouse," the data structures and programming for each trial are uniquely tailored to the particularities of the trial's design. Stovepiped designs can be reused only if the next trial is identical to this first. Yet for most companies, even trials within the same therapeutic area often experience 20% to 50% variation in their CRF elements alone, as the science underlying their designs evolves.

(Variability of 20% to 50% between trials were the targets adopted by the CRF Reference Library team at a leading medical device company at which the author consulted as a clinical data warehouse architect. The target arose from an analysis of CRF elements shared by six recent trials in the same product line. These planning targets lay within the same range as those adopted by a major pharmaceutical firm, where the author also consulted.)

Working with a clinical data warehouse designed for trial integration and re-use obviates the trial-by-trial development of data schemas, table joins, aggregation logic, and output formats—all inordinately expensive and time consuming to continually re-invent and validate. Understandably, building a second generation iCDW is not just a software challenge, but also requires efforts on several fronts. First, a company will need to provide some codification of its research data. Leading pharma and device companies have already started developing reference libraries that guide their researchers toward standard phrasing and recording for the CRF items they use repeatedly in their trial protocols. We see similar industry initiatives such as MedDRA for clinically validated medical terminology. Recent CDISC formats now standardize clinical data interchanges and will soon be the lingua franca for submissions to the FDA.1

Secondly, iCDWs will require a codification of the context of observations within the study's design. Luckily, other industries have already pioneered the notion of "universal data models" that express in a generally accepted and reusable fashion the interrelation of business components with their applications. Life sciences can achieve this also, because with few exceptions every trial involves the same links between such notions as protocol, patient, treatment group, and visit. Indeed, data modelers in the pharma industry have begun publishing suggested schemas, hoping to obviate senseless variation in basic data designs between trials.2-3

With this backdrop of standardization, the only missing foundational component for the iCDW is a repository for trial data flexible enough to accommodate the inevitable variation in CRF elements. Luckily, data modelers have also devised the means for abstracting a research observation for universal storage. At the cost of some additional processing, mostly upon retrieval, the iCDW converts each item in a CRF into a "name-value pair" stored independently in the database, as depicted in Figure 2.

Figure 2. Abstracting CRF columns into iCDW rows.

As a result of this data abstraction, what were columns in a first generation repository become rows in the second generation iCDW. Formerly "short and fat" CRF tables become "tall and skinny." Observations that were once stored in the DEMOGRAPHICS table for Trial A and PHYSICAL CHARACTERISTICS table for Trial B are now both stored in a single table called simply OBSERVATIONS. The name of the source trial and CRF is stored alongside as an attribute of each observation, supporting trial-specific retrieval. However, the CRF attribute can be ignored or more accurately re-assigned upon cross-trial retrieval so that results can be collated in any manner that makes sense. Astonishingly, this architectural solution achieves cross-trial integration of CRF data almost as an afterthought.4-5

Single and multitrial advantages

The enormous benefits offered by the iCDW begin with a single trial by accelerating the execution of the statistical analysis plan. The biostatisticians within the company who are entrusted with the SAP are typically adroit at statistical inference, but extremely vexed by the drudgery of marshalling together the many tables that comprise a trial's raw data. Within the iCDW, the elements that make trials unique have been abstracted so that their variation affects only the value of a few attributes stored with each observation. The physical column in which all observations reside now remains the same from trial to trial. Granted, the firm must develop a de-abstraction routine that swings the desired observation out from the tall and skinny table into the desired short and wide analysis dataset, but this routine is re-usable, a crucial note.

Once given a spreadsheet-like interface to make its usage straightforward for biostatisticians (or the IS team supporting them), the de-abstraction routine needs only to rerun for subsequent trials, obviating the inordinate development or time-consuming revalidation work required for each trial by the first generation approach. One medical device manufacturer estimated such a reusable clinical data management platform would reduce the "data gymnastics" required for each trial by 75%, and that the SAP for a typical trial would consequently complete in one week rather than one month. In a marketplace where the top 40 drugs average $500 million dollars per quarter, streamlining the SAPs for just the final analysis of these compounds would yield between $50 and $350 million.6 With vendor-supplied iCDW implementations running under $5 million, the projected savings over multiple trials suggests that even midsize R&D companies and clinical research organizations can expect a 10-fold return on investment with an iCDW when applied to trial-specific analysis alone (summary of initial list price quotes for software and implementation services received from four major vendors in August 2003).

Yet these rewards pale in comparison to the cost savings, reduced time-to-market, and improved R&D impacts that arise once the iCDW can be used to combine clinical data across trials. Consider how cross-trial integration can make products more competitive. Any given clinical trial is very focused by design upon the safety and efficacy of a single product. By combining data across trials, biostatisticians can mine for correlations between efficacy and product characteristics outside the narrow focus of a single compound or device, revealing new advantages to existing products and identifying subpopulations that could be more effectively served. Such insights will point the way to improved product formulations and designs.

As a hypothetical example from the world of medical devices, a recent paper in a medical journal suggested that strut diameters among cardiac stents are directly related to the reoccurrence of vascular blockage or "restenosis."7 A stent maker equipped with an iCDW could readily assemble data from five of its past stent trials, each of which focused on a different strut dimension. Quickly, it could statistically prove that thinner struts yield better outcomes, and simultaneously point out that it offers stents with some of the thinnest struts on the market. This type of compelling argument is exactly what marketing executives want and expect from their research departments.

As a second hypothetical, consider a Fortune 100 drug maker that had gained approval of a compound for reducing ocular edema. The drug was approved for prescription by ophthalmologists, of which there are approximately 80,000 in the United States, generating a market slightly under $100 million per year. Considered separately, the data of each trial only hinted at improved vision among diabetics as one of the secondary benefits of the drug. By assembling the data of multiple prior trials, a sample population of diabetics was created with enough statistical power to infer with greater than 95% confidence that the compound did in fact improve visual acuity in diabetics. Upon resubmittal of the compound to the FDA for an expanded indication, this compound was approved for prescription by general practitioners, of which there are several hundred thousand in the United States alone. The relatively simple act of combining data across trials enabled a five-fold increase in the number of practitioners that could prescribe the medication, expanding the market for the drug to where it generated several billion dollars of revenue over a five-year period.8

If cross-trial data integration is so valuable, it is natural to ask whether it can be achieved by manual means rather than requiring the implementation of a second generation CDW. Collating data across trials that have not been formatted for integration turns out to be an inordinately labor-intensive task. The device manufacturer cited earlier discovered that the manual collation of only four trials required three person-months as the biostatisticians carefully knitted together 600 items scattered among 50 data tables with missing or incomplete metadata. Not only could the professional time consumed by this painstaking effort have been better invested in designing new trials rather than rehashing old ones, the results were not as compelling as they could have been because six other trials worth of data were excluded because the budget for the exercise was exhausted.

The three-prong endeavor of implementing the iCDW—reference library, standardized data model, and abstract CRF item storage—pre-processes much of the integration work for such cross-trial analyses. The task is reduced to simply selecting which elements the warehouse should combine in one output column versus separating into individual fields of the analysis dataset. Furthermore, the data mining empowered by an iCDW will allow life sciences companies to improve their trial designs. By increasing the statistical power of the data, biostatisticians will be better able to predict such things as the concomitant medications that will impact a given class of compounds and the physical characteristics linked to more frequent follow-up events. Such insights will allow protocol designers to better target their inclusion and exclusion criteria, as well as adapt their visit schedules, thus avoiding contingencies that prolong and undermine the quality of research.

Shortening the length of trials

The final benefit of iCDWs that we will consider may be the most important of all: Bayesian inference. It will allow companies to utilize the recent innovation in biostatistics, which promises to regularly reduce trial size and duration within a given therapeutic area. Whereas old-school, frequentist statisticians believe that each trial must observe for itself outcomes across the entire domain reported for each treatment group defined, Bayesian statisticians argue that inferences made in other trials with other patients can be carefully applied to a new study's data analysis. Based upon the theories of conditional probability promulgated by the 19th century mathematician Thomas Bayes, Bayesian biostatisticians employ prior knowledge to increase the predictive power of a given set of trial data. This advantage is rapidly gaining adherents and acceptance, as Figure 3 implies, including a steady stream of presentations among the advisory councils of the FDA.


Figure 3. Growing occurrence of Bayesian inference topics in medical journals.

Applying the concept to life sciences, consider the occurrence of anemia in the subjects of an antiviral study. The goal is to state with 99% confidence that this adverse event is not causally linked to the drug being studied. A frequentist would design the trial to gather observations across a wide range of dosages, leading to a large number of subjects. The Bayesian design, however, would require a much smaller population because its analysis would include the probability that a 500-mg daily dose is safe, based upon good scientific reasoning supported by the ADME and pharmacodynamics documented in previous trials.

The controversy over this technique centers upon appropriateness of the prior knowledge that gets "pipelined" into a current study. Use of Bayesian inference must include a careful disposition on how the raw data from previous trials was identified and screened to remove significant "covariates." To make it work, the analysis team must be able to start with data of as many prior trials as possible stored in a repository that does not impede their efforts with trial boundaries. When it works, the trial requires fewer data points to statistically prove a product's safety and efficacy, thus allowing smaller trials to achieve more.10-11

Re-basing entire research programs on smaller trials improves the economics of clinical research, and even the life sciences company's revenue stream. Clinical trials entail high per-subject cost—$2,500 per patient in drug trials and $10,000 per patient for devices are commonly cited figures. If even one Phase III trial can be brought down from 5,000 patients to 3,000 patients, the Bayesian inference enabled by the iCDW would save the sponsor over $5 to $25 million on its first use.

Yet far more dramatic is the impact that reduced patient populations will have on an organization's revenue stream. For obvious reasons, life sciences companies pursue their trials using physicians and medical centers they already know well, a practice that imposes upon them a finite capacity for completing a study, often expressed in a maximum number of patients per week. By using Bayesian inference to reduce the number of subjects within each trial, a company will shorten the time required to process the necessary number of patients through its existing research network. Combining this effect of Bayesian inference with the accelerated execution of the SAP and other benefits explored earlier across the multiple study phases required to approve a new product, drug and device developers investing in an iCDW can shorten a new product introduction by a year or more. With new blockbuster drugs averaging revenues of $500 million per quarter, the first-year value of improved time-to-market offered by this single technology alone translates to $2 billion of additional sales per major product supported.


This article detailed multiple means by which cross-trial data integration can reduce the cost, size, and number of trials required to gain FDA approval for a new product. We explored how this streamlining of the clinical research process generates cost savings and revenue enhancements that will grow beyond the $100 million level for even a midrange life sciences company. Once a company has re-engineered its clinical data repository so that it can accommodate multiple trials of widely varying designs, the financial benefits will far outweigh the expense required to build the iCDW. In the highly competitive environment that recent achievements in basic science have created for drug and medical device developers, a tool costing only tens of millions of dollars to deploy that enables them to reduce the drug and device approval process by months or years will seem an easy investment to make.


1. Formats for "submission data model" and discussions of XML format for such data interchanges, available at


2. R. Hughes, "Towards a Universal Data Model for Pharmaceutical Clinical Trial Data," DM Review (February 21, 2003), available at

3. L. Silveston, Data Model Resource Book, vols. 1, 2 (New York: John Wiley & Sons, 2001).

4. P. Nadkarni, The EAV/CR Model of Data Representation, accessed at

5. R. Chen et al., "Exploring Performance Issues for a Clinical Database Organized Using an Entity-Attribute-Value Representation," Journal of the American Medical Informatics Association, 7 (5) (September/ October 2000).

6. "Lipitor, Zocor Top 2Q Sales Charts," Contract Pharma (September 2001), available at

7. J. Pache et al., "Intracoronary Stenting and Angiographic Results: Strut Thickness Effect on Restenosis Outcome (ISAR-STEREO-2) trial," Journal of American College of Cardiologists, 41 (8) 1283-1288 (April 16, 2003).

8. E. Zander et al., "Maculopathy in Patients with Diabetes Mellitus Type 1 and Type 2," British Journal of Ophthalmology, 84, 871-876 (August 2000), available at

9. "Panel Examines Moving CDER's Thinking Forward to 'Desired State,'" News Along the Pike, 10 (3) (August 2004), Center for Drug Evaluation and Research, accessed at

10. R. Weiss, "Bayesian Sample Size Calculations for Hypothesis Testing," Journal of the Royal Statistical Society (Series D): The Statistician, 46 (2) 185-191 (July 1997).

11. S. Senn, "Consensus and Controversy in Pharmaceutical Statistics," Journal of the Royal Statistical Society (Series D): The Statistician, 49 (2) 135 (July 2000).

Ralph Hughes, MA, PMP, is managing architect with Ceregenics, Inc., a data warehouse consulting association in Denver, CO, (720) 951-2100,, and is a founding member of the World Wide Institute of Software Architects. (