Using Public and Private Data for Clinical Operations


Applied Clinical Trials

Applied Clinical TrialsApplied Clinical Trials-12-01-2016
Volume 25
Issue 12

Comparing mean vs. median to uncover the full data picture of site-level performance.

Leveraging data to support evidence-based enrollment planning and site identification for clinical studies is a hot topic among companies looking to streamline study start-up. With trial costs continuing to escalate, and start-up for sites costing approximately $20,000 to $30,000 per site,1 pharmaceutical companies are heavily invested in ensuring the study will deliver according to time and budget predictions and that the countries and sites that are selected recruit successfully. 

Accessing accurate metrics to support study planning and to predict successful site recruitment is critical to this evidence-based approach. Most companies are using actual site-level data on historical performance from their own company’s clinical trial management systems (CTMS). Because these data are collected at the site level, a number of metrics can be calculated for each protocol such as: mean, median, quartiles, standard deviation, min/max and % of sites enrolling zero or one patient.

Outside of a company’s CTMS, the other primary source of performance metrics is publicly available information from registries including, and other sources such as publications or press releases. In contrast to the actual site-level information available for each protocol through CTMS, the public data may or may not contain a site or investigator name. Further, performance information is typically only available at the study level for a few key metrics (e.g., number of sites, total enrollment, and study open/close date). Because metrics are only available at the study level, the public data is typically limited to calculations of mean (i.e., it is not possible to calculate median, standard deviation, quartiles, etc.).

When sample data is normally distributed, the mean, median and mode will all be the same. However, when data is skewed, mean loses the ability to provide an accurate representation of the mid-point of the data and median becomes a better estimate of the mid-point, as it is not strongly influenced by skewed values.2 Because clinical trial enrollment is rarely normally distributed (often having a spike for zero enrollers and a long tail), median is the metric preferred by most companies trying to predict enrollment through evidence-based methods.

In this paper, we investigate the potential differences between using mean calculated from public data only available at the study-level and median calculated from site-level data from CTMS.


In order to evaluate mean study-level enrollment metrics from public data vs. median site-level information from CTMS, we set out to compare protocols available in both and in the Investigator Databank, a collaboration sharing CTMS data from five major pharmaceutical companies. (Note: DrugDev technology facilitates the Investigator Databank and TransCelerate’s Investigator Registry).

Protocol Selection

To begin the analysis, we identified protocols in that fulfilled the following criteria:

  • Sponsor company was one of the five pharmaceutical companies participating in the Investigator Databank

  • Phase II or Phase III clinical trial

  • Interventional studies

  • Primary completion date during 2014

Once the set of protocols and the associated “NCT” numbers were identified, we used this information to find the matching studies in the Investigator Databank database. 

Protocols were excluded from the analysis set for any of the following reasons:

  • No sites listed in (e.g., “Japan only” listed) or Investigator Databank

  • Stopped studies in and terminated studies in Investigator Databank (would generate incomplete metrics)

  • Studies showing illogical dates in CTMS (i.e., would generate inaccurate metrics)

  • One protocol with repeated records from the same PI and site in

We did not exclude protocols with only one site (n=3), because in all instances, the number of sites matched between and CTMS.


We extracted the number of sites, number of patients enrolled, and enrollment months from using the following approach:

  • Number of sites were manually captured from site list (i.e., a count of sites listed on

  • Number of patients enrolled; downloadable variable

  • Enrollment months: calculated from the date of status “Recruiting” to the date that the status changed to “Active-not recruiting”

Based on these variables from the public data, we calculated mean patients per site and mean patients per site per month for each study.




For the same set of studies, we used the CTMS data medians to calculate patients per site and patients per site per month based on actual site-level metrics for the number of patients enrolled, site open date and last subject enrolled (LSE) date at the site for each study. Enrollment months was calculated as the time in days from open date to LSE (or last subject closed [LSC] if not available) at the country level /30.417. LSE/LSC at the country level was used, as it more accurately reflects the full enrollment period available to a site.


In total, we identified 222 studies in to obtain a total of 97 trials in our analysis sample. Table 1 shows the breakdown of the number of studies that were included/excluded in the analysis.

Across the 97 trials in the sample, the average study-level patients per site calculated from public means from was 11.31, compared with the site-level value of 9.40 calculated from CTMS medians (see Figure 1A below). In Figure 1, each point represents a study, with the mean plotted against the median. The diagonal red line is the line of equality. Figure 1A shows for the majority of studies that study-level mean was greater than the site-level median. 

As displayed in Figure 1C and 1D, the average magnitude of the difference was ±38%, and in 75% of protocols tested, the study-level patients per site from the public data (mean) was greater than the site-level patients per site from CTMS (median).

For patients per site per month, the opposite was true. Here, the average study-level patients per site per month calculated from mean data available in was 0.84, as compared to the site-level value of 1.03 calculated using median data from CTMS (Figure 1B). Figure 1B shows for the majority of studies study-level mean was less than the site-level median.

The average magnitude of the difference was ±86%, and in 80% of the protocols tested, the study-level patients per site per month from the public data (mean) was less than the site-level patients per site per month from CTMS (median). See Figure 1C and 1D. 

We then investigated if there was a difference in these findings across therapeutic areas by examining results based on medical subject headings (MeSH),4 a standard definition of diseases maintained by the National Library of Medicine and used by and other public data sources (e.g., Pubmed). Specifically, we targeted the therapeutic area analysis to the three largest therapeutic areas in our analysis set: endocrinology (n=11), cardiovascular (n=10), and neoplasms (i.e., oncology; n=9).



  The results of the analysis are displayed in Figure 2 and Figure 3 (see below). As seen in these figures, the differences between average study-level means from public data and average site-level medians from CTMS data were consistent across the cardiovascular and oncology therapy areas for both patients per site and patients per site per month in terms of direction (study-level greater than site-level in patients per site and study-level less than site-level for patients per sites per month). For endocrinology, the difference was reversed for patients per site per month. For both variables, the differences between study-level and site-level data were most marked for the neoplasms (oncology) therapy area.

The average magnitude of the difference between study-level means calculated from and site-level median calculated from CTMS is displayed in Table 2 below. Here, the average percent difference in patients per site ranged between ±20% and ±71% (compared to ±38% overall). Similarly, the average percent difference in patients per site per month was ±38% and ±76% (compared to ±86% overall). In both cases, the greatest variance was observed for neoplasms studies.


First of all, the sample size of 97 protocols was small, especially when broken down by therapeutic area, with only nine to 11 studies per analysis group. The reason behind this small sample size was our approach to minimizing selection bias in the protocols selected by controlling for a number of factors including: phase, study type (interventional) and study close date (2014). It was notable, however, that even with the small sample sizes, the differences between study-level mean public data and site-level median data were consistent across all therapeutic areas investigated for the patients per site metrics, and across two of the three therapy areas for patients per site per month. 



Another potential limitation is the fact that the analysis is based on studies from only a few specific large pharmaceutical companies. Nevertheless, given the size of these organizations, we believe that the analysis represents approximately 20% of global clinical trials. While we believe that the results are representative of at least large pharma, it is possible that the experience of smaller biotech companies could differ. 


Mean (or average) and median are statistical terms that are used to understand a distribution: mean refers to the arithmetic average and median represents the observation at the midpoint or 50th percentile of the sample. Because the mean can be largely influenced by outliers (i.e., any single value too high or too low compared to the rest of the sample), it is best used for normal distributions. In contrast, the median is often taken as a better measure of a midpoint in describing skewed distributions.2  

Given differences across and even within a country-in time to contract execution, IRB/ethics review, drug supply availability, site initiation visits and patient numbers/prevalence-clinical trial recruitment distributions are rarely normal and,

thus, the median is likely more representative of the midpoint for study, country and site enrollment projections. The relatively high percentage of zero enrolling sites (estimated at 10% to 20% by the Tufts Center for the Study of DrugDevelopment1) also contributes to a skew to the enrollment distribution (i.e., not normal). 

Our results show that mean patients per site calculated from study-level data available from public sources was greater than median patients per site calculated from CTMS data available at the site level. However, the relationship for patients per site per month was reversed, in that mean study-level patients per site per month from public data was less than median patients per site per month calculated from CTMS. This appears to be as a result of less accurate overall study-level dates available on for enrollment duration (calculated from the date of status “Recruiting” assuming all sites are open to the date that the status changed to “Active-not recruiting”) compared to the more detailed actual site-level start-up dates and enrollment dates available from CTMS. 

In our analysis, using the study-level mean from public data to predict enrollment at the site level could result in an estimate that could be ±38% from the actual enrollment experienced for the site and ±86% variance from actual patients per site per month.

With variation in the magnitude and direction of metrics at the therapeutic and protocol level, we think it will be quite difficult to identify an algorithm to create a reliable adjustment factor to bring the public data means in line with the median calculable from CTMS data at the site level.

In addition to the impact on study planning, lack of access to accurate site-level performance metrics can also affect site identification. Selecting sites without objective evidence is a key contributor to 10% to 20% of investigative sites failing to enroll a single patient. Furthermore, an additional 37% of sites under enroll.1  ()

The combination of inaccurate study planning and poor site identification/site selection leads to a large proportion of studies requiring the addition of “rescue sites”-sites that are added in order to address a shortfall in patient enrollment. Requiring rescue sites to complete a trial has negative implications for costs and trial timelines. The cost associated with initiating a site is estimated at $30,000, and the estimated timeline to move from pre-visit through to site initiation is eight months.5,6

Finally, moving beyond the study sponsor perspective, unrealistic enrollment targets based on public data sources could also have a significant impact on investigator satisfaction, leading to turnover, which is a recognized issue in clinical development. 

From our analyses, we conclude that robust evidence-based study planning and site identification requires access to accurate site-level information, which is not available from public sources (i.e., only contains information at the study-level). Site-level data allows calculation of not only median, but also mean, min/max and standard deviation, all of which can help to round out the full picture of data needed for study planning and site identification.

When the first version of was made available to the public in February 2000, it was aimed at standardizing reporting of key trial features to allow searching and to support good reporting practices. As can be seen by the fact that the database now includes over 216,000 registered studies, has greatly expanded access to trial information for both the public and industry. When initially conceived, was not designed for in-depth analyses of clinical study operations metrics. 

Recognizing the value of actual site-level data, pharmaceutical companies have begun to look for alternatives to study-level mean data from public sources. With the emergence of cross-pharma company collaborations, such as Investigator Databank and the TransCelerate Investigator Registry, sharing of site-level CTMS data has become an attractive option for increasing the pool of evidence available to support decision-making on enrollment projections, country selection and site identification as part of a multi-factorial approach. Experience from companies participating in these collaborations suggests that they are not only benefiting by decreasing the costs and timelines for their studies through the need for fewer rescue sites, but also helping to sustain the investigator pool by decreasing site administrative burden.7


Claire Sears, PhD, is Communications Director, DrugDev Data Solutions, email:; Elisa Cascade, MBA, is President, DrugDev Data Solutions, email:

*The authors would like to acknowledge Ken Getz for helpful discussions on the manuscript.



1. Tufts Center for the Study of Drug Development Impact Report, Volume 15, Number 1, January/February 2013


3. Thoelke KR. A Data-driven approach: Are emerging markets the only answer to oncology clinical trial recruitment? Applied Clinical Trials. 21(5).


5. START Study Tufts CSDD, 2012. Ken Getz’s presentation entitled: Uncovering the drivers of R&D costs.

6. Lamberti MJ, Brothers C, Manak D, Getz K. Benchmarking the study initiation process. Therapeutic Innovation & Regulatory Science. 2013;47(1):101-9.

7.  Sears C, Cascade E, and Klein T.  To Share or Not to Share? Exploring the Benefits from Cross-Company Data Sharing for Study Planning and Investigator Selection. Applied Clinical Trials. 25(8).

Related Content
© 2024 MJH Life Sciences

All rights reserved.