Harnessing the Power of Scientific Surveillance

Published on: 
, , , ,
Applied Clinical Trials, Applied Clinical Trials-04-01-2023, Volume 32, Issue 4

Statistical methods used via this technique in centralized monitoring.

Although most clinical trial data quality benefits from centralized statistical monitoring, critical data, processes, and associated risks identified via risk assessment following the Clinical Trials Transformation Initiative (CTTI) framework1 should drive the overall monitoring approach. Inclusion of scientific surveillance techniques as part of centralized monitoring2 enhances our ability to monitor critical and non-critical data, which are particularly relevant and recommended on trials at higher risk for measurement error, such as those with patient-reported outcomes (PROs) and clinician-reported outcomes (ClinROs).

Methods employed by scientific surveillance that enhance our ability to monitor and mitigate risks are well aligned with regulatory guidance document recommendations on including statistical surveillance as a key component of centralized monitoring. Those include FDA guidance on risk-based monitoring3; a European Medicines Agency reflection paper from 20134; FDA’s Q9 (R1) draft from 20225; International Conference on Harmonization E6 (R2)6 and E8 (R1)7; and guidance from the National Medical Products Administration Center for Drug Evaluation (CDE). Some guidance documents call out specific statistical monitoring methods (e.g., the 2022 FDA Q9 (R1) and CDE guidance documents), many of which are included in scientific surveillance.

Application of scientific surveillance

Many factors can degrade data quality8-11 in clinical trials and result in scientifically incompatible data. This may include:

  • Heterogeneity in assessments arising from the subjectivity of PROs
  • Rater changes, rater’s “drift” (changes in rater behavior across different test administrations), and fatigue
  • Expectation biases among trial investigators and subjects
  • Uses of prohibited medications and changes in background medications that affect study endpoints, directly or through interactions with study treatment
  • Clinician/site personnel changes
  • Insufficient subject engagement
  • Fraud
  • Increasingly complex protocols, such as adaptive designs or treatment regimens specific to certain geographic regions or to subjects with specific demographic characteristics

These factors occur due to myriad reasons such as sloppiness, insufficient training or lack of engagement and manifest as too much or too little variability, inconsistencies between related scales, “too-good-to-be-true” values or implausible temporal shifts in outcomes, causing systematic irregularities and increasing measurement error. In placebo-controlled studies, variability also can be introduced by the subject, caregiver or investigator and placebo response rates can influence outcomes profoundly. Additionally, predicting key study level risks such as premature treatment discontinuation or lost to follow up can facilitate taking early actions to minimize these risks.

Scientific surveillance includes methods that fall into three main categories:

  • Detecting scientific inconsistencies (or incompatibility)
  • Risk prediction
  • Detection of inflated placebo response on blinded data

Detecting scientific incompatibility

For many types of endpoints, no systematic differences are expected between sites. Hence, statistical validity is enhanced if sites’ measurements are similar in various respects. For instance, within a given scale (such as a depression rating scale), item scores are expected to exhibit similar correlations for all sites. Thus, statistical validity is enhanced if such correlations are shown to be similar between sites.

Scientific incompatibility begins with the calculation of site-level correlations (among item scores within a single scale or among scores from different scales) and corresponding overall study correlations. The presence of only random variation between outcomes of interest is tested. For a site to be declared unusual, it must lie atypically far from a central point calculated using data from all sites. Distance metrics and statistical tests using simulations are applied and the false discovery rate is controlled, after which sites of concern are identified.12,13


Subject-level correlations among item scores within a single scale or among scores from different scales and the corresponding overall study correlations also are calculated. Distances between each subject-level correlation and the overall study correlation are calculated and standardized by dividing by the number of non-missing elements in the subject’s correlation matrix.14

Figure 1 below shows two sites flagged (dark blue) based on site-level correlations for absolute values for all items within a PRO questionnaire (e.g., EQ-5D-5L) and four sites flagged for change from baseline values for the same questionnaire compared to all other sites after controlling the false discovery rate.

Subject-level multivariate mean (calculated using all relevant outcomes of interest) and correlation inliers and outliers are identified by using distance measures between subject and overall study by using standard rules associated with boxplot outliers. Observations lying at least 1.5 times the interquartile rate above the upper quartile or below the lower quartile are classed as unusual.

Figure 2 below depicts three subjects (shown as bubbles below the lower dotted control line) with a multivariate mean (using two PRO scales) below the lower quartile compared to all other subjects in the study.

Control charts recommended as statistical monitoring tools in the FDA Quality Risk Management Q9(R1) 2021 guidance are used to compare a site-specific summary measure to the study reference. Outlying sites, identified as having differences in means or variances two or three standard deviations from the study reference, are flagged for further investigation (subject to limitations as discussed later in this article).

Figure 3 below shows two sites that are at least two or three standard deviations below the control limit (shown as bubbles below the lower dotted control line) comparing variance for an outcome of interest pointing to unusually low variability at these sites. The size of the bubble is indicative of the number of subjects in that site, with larger bubbles corresponding to larger number of subjects relative to other sites.

Statistical summaries for outcomes of interest (such as those also mentioned later in this article) are visually inspected for total score/result inconsistencies, especially for sites and subjects that were flagged using multivariate inliers, outliers, and correlations.15 A visual inspection of subject-level outcomes within flagged sites is then assessed for the source of inconsistencies. For example, item scores within a PRO questionnaire, such as EQ-5D-5L, might be seen as repeated (i.e., response propagation) for multiple subjects across multiple visits at the flagged site based on site-level correlations.

Control charts also are used to compare site-specific responder rates using count type data (e.g., reduction in monthly migraine days by a defined timepoint by 30%) to the study reference value. Poisson regression with an offset is used.

Indication-specific consistency checks are implemented as applicable. For example, the primary endpoint for schizophrenia protocols is often the PANSS (positive and negative syndrome) total score. The International Society for CNS Clinical Trials and Methodology convened an expert working group that established consistency/inconsistency flags for the PANSS in 2017. The general strategy was to define irregular (e.g., excessively variable or insufficiently variable) scoring patterns, as well as incompatibility in scoring among items within the scale (i.e., cross-sectionally) and between assessments (i.e., longitudinally). The working group identified and classified 24 flags based on the extent to which they suggest error (possibly, probably, very probably, or definitely) within PANSS items or across repeated assessments. Scientific surveillance includes analyzing these flags for consistency to drive targeted actions.

The statistical monitoring methods summarized previously are applicable to a broad array of trials, spanning different therapeutic areas and clinical indications such as neuroscience, dermatology, immunology, oncology, cardiovascular outcomes, and others. Table 1 below provides examples of outcomes monitored through scientific surveillance across various therapeutic areas and indications. Key findings and recommended actions are then summarized at the study and site levels, based on these analyses.

Predicting risks

Ideally, risks are predicted early in studies before they materialize. Early prediction facilitates adjustments to processes and retraining of involved clinical personnel so that problems do not worsen. If, for example, a misleading statement in a case report form completion guideline will lead study sites to conduct an assessment at the wrong time in subjects’ visits, then early identification of the wrong assessment times, using early on-study data, could facilitate identifying and adjusting the misleading statement before many visits are affected.

Risks are predicted using Bayesian statistical analysis which, unlike many traditional analyses, can generate probabilities about future (or otherwise unknown) events.

For instance, Bayesian prediction can generate the probability that, once the last subject completes the last visit, the to-be-observed mean deviations per subject will exceed a pre-specified threshold corresponding to some minimally acceptable level of trial quality.

In scientific surveillance, Bayesian predictions about variables of interest consist of probabilities that future, to-be-observed proportions of subjects, or means over subjects, will meet given criteria, assuming a specific number of future subjects will be observed. The criteria usually consist of upper and/or lower thresholds.

While the choices of thresholds may vary, they often correspond to one and two times the 95% upper one-sided asymptotic limits based on the trial sample size and the prior expected value selected.

Assessment of placebo response inflation

In efficacy analyses of clinical trials of investigational drugs, placebo response refers to a tendency of placebo-treated subjects’ efficacy endpoints to improve due to the psychological effects of trial participation rather than any pharmacological effects.

While trial sponsors usually account for expected placebo response levels when designing trials, the actual level in any trial may exceed those expected levels, jeopardizing assay sensitivity, i.e., the ability to distinguish an effective treatment from a less effective or ineffective treatment (as defined in ICH E10 §1.5). Therefore, methods following Hartley (2012) for continuous endpoints16 and Hartley (2015) for binary endpoints17, have been developed and implemented for measuring, without unblinding treatment codes, the Bayesian probabilities that the placebo response exceeds expectations, and that it exceeds expectations by some clinically important amount or more.


The site correlation method described earlier only includes sites with a minimum of three subjects. It cannot incorporate smaller sites.

Similarly, site performance on efficacy measures only can be effectively assessed using the methods described in this article once a minimum number of subjects have completed assessments for the relevant post-baseline visit. Once this is the case, potential actions can be put forward.

The subject-level methods apply only to subjects with data from at least two post-baseline visits. Due to the limitations of blinded data, results are interpreted with caution. Overinterpretation of subject-level responses is avoided.

The Bayesian placebo response assessments apply only to binary and continuous (normally distributed) data in two-group blinded parallel-group trials.

Process and actions

Scientific surveillance is well integrated into the layered approach to protecting data integrity within centralized monitoring, as shown in Figure 4 below.

As with all risks, the process includes documenting risks that are detected via scientific surveillance in the Risk Assessment and Categorization Tool (RACT).18

Each analysis is documented in the centralized monitoring plan, along with a brief description, and is applied with the appropriate cadence. This is typically in alignment with centralized monitoring reviews after an adequate number of randomized subjects have primary endpoint data available, as per the limitations noted, such that:

  • All relevant data supporting critical endpoints are used in scientific surveillance
  • Scientific surveillance is performed after sufficient accumulation of data on the primary endpoint
  • Results or findings from scientific surveillance are included as part of the full centralized monitoring report
  • The findings from scientific surveillance are discussed in a cross-functional centralized monitoring meeting, including review of all centralized monitoring findings and actions
  • Trends identified as risks are shared and discussed as a part of ongoing risk assessment meetings, which may warrant adding new risks to the RACT
  • Findings and actions from scientific surveillance related to ratings (PROs, ClinROs) generally are passed on to those responsible for quality reviews of that data, which can drive decisions to implement remediation protocols or put certain raters on a watch list
  • The clinical trial manager coordinates most site-level actions with the clinical research associate, especially if findings call for the investigation of site processes
  • Study-level risks, such as training gaps or the need for a protocol amendment or protocol clarification letter for all sites, may be identified to drive actions at the sponsor or program levels

Actions might target trial conduct, such as evaluating site process or improving subject-caregiver engagement, or might target the analyses plan, such as updating the analysis set definitions to exclude subjects from sites with serious concerns or adding sensitivity analyses to stress test impact on the analyses of primary and key secondary endpoints. Data correction may be performed where appropriate.

Conclusions and outlook

Incorporating advanced statistical monitoring methods in centralized monitoring significantly improves scientific integrity in clinical trials especially those with PROs or ClinROs. Implementation of scientific surveillance within the layered framework of centralized monitoring facilitates risk identification from multiple angles, ultimately contributing to a holistic risk detection and mitigation process resulting in tighter quality control. Trials of the future need to be more resilient to environmental disruptions and evolve with technological advancements.

Implementing scientific surveillance can be quite effective in detecting data errors that carry the highest potential to jeopardize study integrity. These advanced statistical monitoring methods are particularly useful as the clinical trial landscape shifts toward decentralization, coupled with continuous technological advancements in how we collect subject data.

The authors would like to acknowledge John Hamlet, Amy Kroeplin, Dorota Nieciecka, Christopher Perkins, and Timothy Peters-Strickland for the assistance they provided in developing this article.


  1. Quality By Design - Overview. Clinical Trials Transformation Initiative. https://ctti-clinicaltrials.org/our-work/quality/quality-by-design/
  2. Agrafiotis, D.K.; Lobanov, V.S.; Farnum, M.A., et al. Risk-Based Monitoring of Clinical Trials: An Integrative Approach. Clin Therapeutics. 2018, 40 (7), 1204-1212. https://pubmed.ncbi.nlm.nih.gov/30100201/
  3. FDA, Oversight of Clinical Investigations – A Risk-Based Approach to Monitoring (August 2013). https://www.fda.gov/media/116754/download
  4. European Medicines Agency, Reflection Paper on Risk-Based Quality Management in Clinical Trials (November 2013). http://www.ema.europa.eu/docs/en_GB/document_library/Scientific guideline/2013/11/WC500155491.pdf
  5. FDA, Q9 (R1) Quality Risk Management (November 2021). https://www.fda.gov/media/159218/download
  6. European Medicines Agency, International Committee for Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) E6 (R2) Guideline for good clinical practice (December 2016). https://www.ema.europa.eu/en/documents/scientific-guideline/ich-guideline-good-clinical-practice-e6r2-step-5_en.pdf
  7. European Medicines Agency, International Committee for Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) E8 (R1) General Considerations for Clinical Studies (October 2021). https://www.ema.europa.eu/en/documents/scientific-guideline/draft-ich-guideline-e8-r1-general-considerations-clinical-studies-step-2b_en.pdf
  8. Baigent, C.; Harrell, F.E.; Buyse M.; et al. Ensuring Trial Validity by Data Quality Assurance and Diversification of Monitoring methods. Clin Trials. 2008, 5 (1), 49-55. https://pubmed.ncbi.nlm.nih.gov/18283080/
  9. Venet, D.; Doffagne, E.; Burzykowski, T.; et al. A Statistical Approach to Central Monitoring of Data Quality in Clinical Trials. Clin Trials. 2012, 9 (6), 705–713. https://pubmed.ncbi.nlm.nih.gov/22684241/
  10. George, S.L.; Buyse, M. Data Fraud in Clinical Trials. Clin Investig (Lond). 2015, 5 (2),161–173. https://pubmed.ncbi.nlm.nih.gov/25729561/
  11. Herson, J. Strategies for Dealing with Fraud in Clinical Trials. Int J Clin Oncol. 2016, 21 (1), 22–27. https://pubmed.ncbi.nlm.nih.gov/26194810/
  12. Kirkwood, A.A.; Cox, T; Hackshaw, A. Application of Methods for Central Statistical Monitoring in Clinical Trials. Clin Trials. 2013, 10 (5), 883–806. https://pubmed.ncbi.nlm.nih.gov/24130202/
  13. Taylor, R.N.; McEntergart, D.J.; Stillman, E.C. Statistical Techniques to Detect Fraud and Other Data Irregularities in Clinical Questionnaire Data. Drug Information Journal. 2002, 36, 115-125. https://link.springer.com/article/10.1177/009286150203600115
  14. Kelly M.O. Using Statistical Techniques to Detect Fraud: A Test Case. Pharmaceutical Statistics. 2004, 3 (4), 237-246. https://doi.org/10.1002/pst.137
  15. Knepper, D.; Lindblad, A.S.; Sharma, G. Statistical Monitoring in Clinical Trials: Best Practices for Detecting Data Anomalies Suggestive of Fabrication or Misconduct. Ther Innov Regul Sci. 2016, 50 (2), 144- 154. https://pubmed.ncbi.nlm.nih.gov/30227005/
  16. Hartley, A.M. Adaptive Blinded Sample Size Adjustment for Comparing Two Normal Means—a Mostly Bayesian Approach. Pharmaceutical Statistics. 2012, 11 (3), 230-240. https://onlinelibrary.wiley.com/doi/abs/10.1002/pst.538
  17. Hartley, A.M. A Bayesian Adaptive Blinded Sample Size Adjustment Method for Risk Differences. Pharmaceutical Statistics. 2015, 14 (6), 488-514. https://pubmed.ncbi.nlm.nih.gov/26403845/
  18. Risk-Based Monitoring Solutions – RACT Template. TranCelerate BioPharma. September 2020, https://www.transceleratebiopharmainc.com/wp-content/uploads/2020/09/8J_RACT-for-Activities_13Mar2015.xlsm