The Quality of Clinical Trials

January 13, 2015
Peter Malamis, Michael J. Howley
Applied Clinical Trials

This article is the first in a series of articles that will review the development and findings of the first scientific measurement of quality in clinical trials.

Abstract

Despite the scientific nature of the clinical trial industry, the quality measures commonly used to measure trial quality lack a scientific foundation. This article is the first in a series of articles that will review the development and findings of the first scientific measurement of quality in clinical trials. The methods utilized have been cited in 16,565 papers and successfully applied across a many different industries, but have not been previously used in the clinical trials industry. In this paper, we begin by noting that measuring service quality requires very different techniques from measuring product quality and describe the different dimensions to quality.

Overall quality score for clinical trials averaged 6.2 on a 1 to 10 scale, with considerable variation around the mean. The overall performance score was 6.3 across sales & contracting, study startup, conduct and closeout stages. Both performance and quality varied by the number of subjects, phase, number of sites, and patients/site.

While the relatively low quality and performance scores are concerning, the considerable variation in the scores is most worrisome. Given the influence of contextual factors on quality and performance, it is particularly important to use multivariate predictive modeling when comparing performance across clinical trials. Benchmarking to averages will be biased. In future papers, we will drill-down on the specific drivers of quality and performance in clinical trials.

Clinical trials are a $71 billion industry that offers services critical to the business success of their sponsors and the health and welfare of patients.1 Despite this importance, there are no widely accepted, scientifically validated measures to track the quality of clinical trials. In a review of the literature and survey of management practice, we have been unable to identify any clinical trial quality measures that meet standard criteria of validity and reliability. This has created an uncomfortable situation: trial sponsors and managers must depend on indirect indicators of clinical trial quality (i.e. meetings to see how the trial is progressing or operational metrics). We see the multiple quality consortia (e.g. the MCC, Avoca Quality Consortium, or Trancelerate), the Quality by Design (QbD) movement, risk-based management, and increasing regulatory oversight as earnest responses to the need to measure quality. We do not believe, however, that they will succeed because they fail to meet current scientific standards for quality measurement.

This lack of scientific quality measurement contributes to industry’s struggles to address problems like cost overruns and adherence to timelines. Effective trial management and quality improvement depend on scientific measurement.2 Valid and reliable quality measures allow managers to identify and adopt best practices to control costs and timelines. Appropriate clinical trial management can only happen if there are scientific quality metrics to improve clinical trial processes. From a practical standpoint, scientific quality measurement enables unbiased benchmarking, key driver identification, and predictive analytics.

The purpose of this article is to describe the scientific basis for quality measurement and describe the top-line results of our research using this methodology. The methods described in this article are based on widely accepted academic research and successfully applied across multiple industries. This article also serves as an introduction to a series of papers on clinical trial performance and quality using these methods. In subsequent papers, we will focus on the specific drivers of performance and quality.3

 

Service Quality   

The first step in quality measurement is to properly define what you are measuring.4 The quality construct has an ethereal nature which can make it difficult to define.5 Juran, in his seminal book on Quality by Design, got so frustrated with the proliferation of shoddy quality definitions that he suggested an international panel to develop a definition for quality.6, pgs. 10-11 But we don’t need an international panel. Academic researchers also noticed the problems of measuring and defining quality and created a conceptual foundation for defining and measuring quality. After a decade of debate, refinement, and 16,565 citations, their work still stands as the standard for defining and measuring quality.7,8,9

It is important to distinguish service quality from product (i.e. manufactured goods) quality.10 Services are fundamentally different from products, which mean measuring service quality is also very different from measuring product quality. So while the quality of a manufactured pill can be assessed with the usual operational metrics, the quality of conducting a clinical trial requires a different approach to measurement.11

The primary way in which services are different from products is that they are intangible and heterogeneous.12 First, services are intangible-they lack objective attributes which can be directly observed. In measuring clinical trials, then, we need to depend on evaluations from specific expert witnesses, meaning a move to purposive instead of random sampling. Second, services are heterogeneous. Each clinical trial is different from previous trials, even when the same protocol is replicated, there will contextual and management challenges that lead to heterogeneity. This creates a couple of problems: If every trial is different from all other trials, how do you create standards for performance and who gets to decide if they meet the standards? The customer-or in the case of clinical trials the sponsor13 or patients-have the prerogative of evaluating the clinical trial because they are the ones who are in a position to judge the value created and adjust expectations to the context. So while there is an intuitive appeal to adopting manufacturing approaches to measuring quality and performance, clinical trials are services and not products. Since services and products are fundamentally different, applying manufacturing measurement to a clinical trial service is not appropriate.

An example will help illustrate the differences between goods and services. Imagine, for example, that you drive your car to a doctor’s appointment. Evaluating the quality attributes for the car is straightforward. As a manufactured good, the car has tangible attributes that can be directly observed and measured. But evaluating the quality of the doctor’s visit is very different. In this service, the doctor is applying his knowledge and skill for your benefit,14 much of which is not directly observable. There will also be different standards by which you evaluate your doctor based on the context. If the doctor is working you into the schedule in the midst of a flu epidemic, for example, your standards for the amount of time and explanations your doctor gives you will be much different that a long-scheduled visit for a routine physical. You, as the patient, retain the prerogative of evaluating the quality of the service and will automatically update your expectations for the different context.15

Given this conceptual foundation, quality is the sponsors’ judgment about the overall excellence or superiority of a clinical trial.16 There are five dimensions of service quality: reliability, assurance, tangibles, empathy, and responsiveness (i.e. RATER dimensions). Reliability means that the service provider is knowledgeable, skilled, and will deliver on the promises made by their sales team. Assurance refers to the ability of the service provider to inspire trust and confidence. Tangibles refer to the physical manifestations of the service such as the case report form or marketing materials in a clinical trial. Empathy means the caring and individualized attention from the provider. In a clinical trial, empathy means that the study team conveys a sense of the importance of the trial to the sponsor. Responsiveness means that the provider is willing to adapt to the needs of the client or recover to service failures. They respond quickly to questions or problems as they arise.

Quality is different from performance or satisfaction. Performance refers to the specific activities that make up the clinical trial process. Since a service is the application of knowledge and skills for the benefit of another,17 the sponsor must evaluate the observable physical actions, mental activities, knowledge, and skills of the study team performance.18 Quality is also different from satisfaction, or the summary and affective fulfillment response from a service.19 Satisfaction is an outcome of quality. The relationships between performance, quality, and satisfaction are shown in Figure 1.

Methods

In order to fully explore performance and quality in clinical trials, we adopted an exploratory posture in this research. All of the research was conducted in association with CRO Analytics. Applied Clinical Trials collaborated on the data collection. The research was conducted in two phases. Phase I was a qualitative methodology in which we interviewed industry experts on performance and quality drivers of clinical trials.20 Our research subjects intuitively broke down performance activities into four distinct stages: sales & contracting, study startup, conduct, and closeout. Phase II of our research was a quantitative study in which we purposively sampled experienced industry executives in an online survey solicited through our contacts, industry association appeals, and outreach to Applied Clinical Trials subscribers.

In this paper, we will focus on the global assessments of performance and quality. The respondents (n= 300) evaluated the overall performance of an individual stage of a trial in which they had recently participated on a 1 to 10 scale (“Please evaluated the overall performance of the study team on … sales and contracting… study startup… conduct… or closeout”). We also asked a subset of respondents (n=52) with full-trial responsibility about the overall quality of a clinical trial (“Please rate the overall quality of the service provided by the study team in this trial”) for which they were responsible. They also completed a SERVQUAL instrument, shown in Table 1, to assess the quality dimensions in clinical trials, which was analyzed with path modeling using SmartPLS 2.0. Based on the feedback of respondents in Phase I of our research, we did not include the tangibles dimension. The quality and performance evaluations were not necessarily linked to the same trial. All respondents were also asked about the phase of the trial, therapeutic area, and regions of the world in which the trial took place.

Findings

Most of the studies recruited in North America (42%) and Europe (28%), with Russia (10%), India (6%), China (3%), Japan (4%), other Asia (7%), and South/Central America (1%) having less activity. The average number of subjects was 1,068 per trial (sd 3,289) and the average number of sites was 97 per trial (sd = 232). For both sites and subjects, there was a long right tail on the histogram. With regard to phase, 39% of the studies were from Phase II, 51% were from Phase III, and 10% were from Phase IV.

 

Results: Quality

There was considerable dispersion of the overall quality ratings, as illustrated by the histogram in Figure 2. The wide variation of overall quality scores lowers the mean (average) down to 6.2 and the median to 7. Not surprisingly, the standard deviation (2.3) and coefficient of variation (37%) are high.

The effect of the RATER dimensions on quality are shown in Figure 3. There was convergent and discriminant validity within and between the quality dimensions. The four dimensions we used explained a substantial proportion of the variance (R2=.85) in the overall quality construct. The reliability coefficient had an estimate (0.73) that was much more substantial than is typically seen in other industries.21 Assurance (0.15) estimate of what is typically seen across industries. Empathy (0.06) was slightly higher than typically seen (0.02 – 0.04). Responsiveness (0.01), on the other hand, was much lower than is typically seen in other industries.

Quality varies by phase of the trial. The average quality for Phase II (μ= 7.0, sd =2.1) was higher than Phase III (μ= 5.6, sd =2.4) and Phase IV (μ= 5.4, sd =2.2). So while Phase 2 was the highest quality, it also has the most consistent (sd =2.1) quality ratings. Phase IV (μ= 5.4) had the lowest quality ratings and Phase III had the greatest amount of variation (sd =2.4) in their ratings.

Quality varied by the number of subjects (Figure 4) and investigative sites (Figure 5). All of the scatterplots were performed with 50% Epanechnikov kernelling and a dotted line is placed at the average quality or performance level. Quality generally declines as the number of subjects increases, except for a bump upward above the average between about 600 to 1,000 subjects. Overall quality has a different relationship to the number of investigative sites: There is a U-shaped distribution: Quality decreases as the number of sites increase until about 75 sites. Then the quality trends upward.

     

Quality also varied by site intensity, or the number of patients per site, was calculated (Figure 6). Quality increases as intensity increases up to about 15 subjects per site and then declines and decreases at about 20 subjects per site before returning to average.

 

Performance

The performance scores broken out by stage of the trial are in Table 2. Perhaps not surprisingly, study startup (μ= 5.6) had the lowest performance ratings while conduct (μ= 6.5) and closeout (μ= 6.6) had the highest ratings. There was considerable variation in the performance evaluations, shown in the histogram in Figure 7. The average (μ = 6.3) and standard deviation (sd =2.2) were similar to the quality variable.

The number of subjects in the trial affected stage performance. We found differences by stage of trial, so they are illustrated separately Figure 8. In study startup and sales and contracting, performance declines as the number of subjects increases. For conduct and closeout, performance drops until about 300 to 400 subjects and then increases as the number of subjects increased. The number sites had a negative impact on performance, as shown in Figure 9. Intensity, shown in Figure 10, had generally beneficial effects on performance peaking at about 50 patients per site

        

 

Conclusion

The results of this research raise concerns about clinical trial quality and trial stage performance. At the very least there is room for substantial improvement in a critical area of an industry at the crossroads of margin and pricing pressures. If you just look at the modal score (8 out of 10) on the histogram in Figure 2 then quality is reasonable-a solid B in educational terms. But the average of these scores was 6.2-or a D. Clearly there is a need for further insight into the factors that drive clinical quality across all stages and how to improve quality. Our research addressed this and in future papers we will drill down on these specific drivers of performance and quality.

The conceptual basis for service quality measurement has important implications for the clinical trials industry. First, it is important that stakeholders accept and incorporate the maxim that to measure quality one must in fact measure quality and not rely on metrics simply because they are available, traditional, or endorsed by opinion leaders. All quality measures should meet basic standards for validity and reliability. Second, the idea of accounting for context and allowing the “customer”, whether sponsor or patient, the prerogative of evaluating the quality of the service links to the patient centricity movement making inroads into clinical trials. The weights attached to the quality dimensions in the clinical trials industry are also interesting. The emphasis on reliability and empathy (i.e. the importance of knowledge of that specific trial) makes sense, but the minimal impact of responsiveness as a driver of quality is surprising. It may be perceived within the industry that it is more important to adhere to scientific and regulatory standards than being responsive to sponsor wants and needs. This thought is speculative, but we will track these proportions as we accumulate data.

The variation in the performance and quality scores are important. Both scores had a ‘wide base’ on their histograms and elevated standard deviations and coefficients of variation. This finding emphasizes the importance of understanding risk, or the probability of low quality or performance scores, in clinical trials. In the future papers in this series, we will drill down on the specific drivers of negative outcomes.

We found a variety of factors influence quality including the phase of the trial, number of subjects, and the number of sites. The implication of these findings is that comparisons of quality and performance and quality between trials must be adjusted for factors like phase and number of subjects and sites before comparisons can rightfully be made. Methodologically, this means that quality measurement must be analyzed by multivariate (typically regression) models. In practice, these findings should put an end to benchmarking to simple averages. Adjusting quality and performance scores for these multiple factors demands multivariate models. Simple (i.e. univariate) benchmarking is prone to bias so it is unfair to make comparisons made across trials based on averages. This should especially be a concern for those firms engaged in risk-based contracting or performance incentives in their contracts.

We understand that some readers may be uncomfortable with this approach to measuring performance and quality. In presenting these data at conference, we encounter people who prefer operational data because they are ‘more objective.’ In their view, the evaluations of quality and performance described here seem too subjective-‘just somebody’s opinion.’ As we have shown, attempts to use operational data as a service quality measure is fundamentally and logically flawed because they fail to account for the differences between manufactured goods and services. Secondly, objective data, especially operational data, typically lack validity as quality indicators. Take, for example, the number of days it takes to recruit patients. Days is a measure of time-not quality. Whether or not 76 days to recruit patients is high performing recruiting depends on the individual trial, so it lacks validity as a quality measure. Thirdly, these assessments are evaluations, not opinions. Remember that we used purposive sampling. We are identifying the individual who witnessed certain aspects of the trial. They are experts in that aspect of a clinical trial. We simply ask them to evaluate the performance or quality that they witnessed. Finally, the performance and quality measures we describe here all meet the statistical standards for validity and reliability. We are not aware of any operational metrics that can meet these basic scientific standards.

In conclusion, we have identified a remarkable situation in the clinical trials industry: despite the scientific basis of the industry, there are no scientifically valid measures of the quality of clinical trials. In this article, we have described some of the conceptual foundation and reviewed some of the high-level quality and performance results of our research. In future papers in this series, we will drill-down on the specific drivers of quality and performance in clinical trials.

 

 

 

References

  1. A. Schafer, “CRO Market Size: Growth in a Flat R&D World,” Applied Clinical Trials, February 03, 2014, http://www.appliedclinicaltrialsonline.com/cro-market-size-growth-flat-rd-world (accessed December 28, 2014).
  2. This idea is often misquoted as ‘you can’t manage what you don’t measure’ from Peter Drucker or Edward Deming (http://www.behindthatquote.com/what-get-measured-get-managed/). For a more nuanced discussion, see D.R. Spitzer, “Performance Measurement Promotes Effective Management,” Transforming Performance Measurement, (American Management Association, New York, 2007), page 13.
  3. An earlier paper describing the drivers of study startup performance can be found at M.J. Howley and P. Malamis, “High Performing Study Startups,” Applied Clinical Trials, 23 (6/7) 20-28. Released online on February 20, 2014 at http://www.appliedclinicaltrialsonline.com/high-performing-study-startups (accessed November 20, 2014).
  4. M.R. Furr, Scale Construction and Psychometrics for Social and Personality Psychology, (Sage, Los Angeles; London, 2011).
  5. The American philosopher Robert Pirsig describes a story a professor who attempts to define quality, but ends up losing his sanity. We all feel Pirsig’s frustration as we attempt define to define quality, but it is an essential step. R. M. Pirsig, Zen and the Art of Motorcycle Maintenance: An Inquiry into Values. (Random House, Chicago, 1974).
  6. J.M. Juran, Quality by Design: The New Steps for Planning Quality into Goods and Services, (Simon and Schuster, New York, 1992).
  7. A. Parasuraman, V.A. Zeithaml, and L. L. Berry, "SERVQUAL," Journal of Retailing 64(1) 12-40 (1988).
  8. A. Parasuraman, V.A. Zeithaml, and L. L. Berry. "A Conceptual Model of Service Quality and its Implications for Future Research." Journal of Marketing 49(Fall) 41-50 1985.
  9. The citation count was taken from Google Scholar on December 30, 2014.
  10. A. Parasuraman, V.A. Zeithaml, and L. L. Berry. "A Conceptual Model of Service Quality and its Implications for Future Research." Journal of Marketing 49(Fall) 41-50 1985.
  11. The principles of measuring service quality apply equally to situations in which the trial is conducted internally and those where the trial is outsourced to a functional or full-service CRO.
  12. We will only illustrate 2 of the differences here to conserve space. For a complete discussion, see V.A. Zeithaml, M.J. Bitner, and D.D. Gremler, Services Marketing, (McGraw-Hill Irwin, New York, 2013).
  13. Again, the sponsor may be internal or external to the organization actually conducting the research.
  14. This is the definition of a service and can be found at S.L. Vargo and R.F. Lusch. "Evolving to a New Dominant Logic for Marketing." Journal of Marketing 68(1) 1-17 (2004).
  15. W. Boulding, A.  Kalra, R. Staelin, and V.A. Zeithaml, “A Dynamic Process Model of Service Quality: From Expectations to Behavioral Intentions,” Journal of Marketing Research 30(1) 7-27 (1993).
  16. Zeithaml, Valarie A. "Consumer Perceptions of Price, Quality, and Value: a Means-End Model and Synthesis of Evidence," Journal of Marketing 52(July) 2-22 (1988).
  17. S.L. Vargo and R.F. Lusch. "Evolving to a New Dominant Logic for Marketing." Journal of Marketing 68(1) 1-17 (2004).
  18. L.L. Berry and N. Bendapudi. "Clueing in Customers." Harvard Business Review 81(2) 100-106 (2003).
  19. R.L. Oliver, "Measurement and Evaluation of Satisfaction Processes in Retail Settings." Journal of Retailing 57(3) 25-48 (1981).
  20. The details of the methodology can be found in M.J. Howley and P. Malamis, “High Performing Study Startups,” Applied Clinical Trials, 23 (6/7), 20-28. Released online on February 20, 2014 at http://www.appliedclinicaltrialsonline.com/high-performing-study-startups (accessed December 12, 2014).
  21. The multi-industry comparisons are taken from A. Parasuraman, V.A. Zeithaml, and L. L. Berry, "SERVQUAL," Journal of Retailing 64(1) 12-40 (1988).