Interexpert Agreement on Adverse Events’ Evaluation

April 21, 2016
Maxim Belotserkovskiy, MD

Maxim Kosov, MD, PhD

Alexey Maximovich

Maria Cecilia Dignani, MD

John Riefler, MD

Eric Batson MD, PhD

Drug safety surveillance, a core focus of clinical trials, can be influenced by subjective judgement, as this analysis of differing expert assessments of adverse drug reactions-and the reasons why-shows.

Safety surveillance is one of the core objectives of every clinical trial. It is based on registration and assessment of adverse drug reactions (ADRs), defined as noxious and unintended responses to a medicinal product related to any dose.1 Not all ADRs are attributed to the study drug and one of the key characteristics of every ADR is causal relationship, or causality-whether the event under question is obviously related to the study drug, or at least such relationship cannot be ruled out. All available information about ADRs observed with the drug is captured in the Investigator’s Brochure, and once the drug is approved for use-in the package insert. This makes assessment of causal relationship a key factor, as underestimation may subject potential patients to an unexpected and unnecessary risk, while overestimation might narrow indications and therapeutic use, or lead to dose modification (e.g., decreased dosage), which may lead to decreased efficacy.

The WHO-UMC assessment scale is meant as a practical tool for the assessment of case reports. It is a combined assessment taking into account the clinical-pharmacological aspects of the case history and the quality of the documentation of the observation.2 Other algorithms are more complicated. Such scales have proved their validity, but have a general disadvantage-they are based on individuals’ medical judgment and thus subjective.  Experts may evaluate similar ADRs differently. Such disagreements have been reported previously and ranged from 51% to 95%.3,4,5 These reports mainly compared judgments made by physicians or experts without analysis of possible reasons for disagreements. Herein, we compared the assessments of 50 ADRs made by four experts from a clinical research organization (CRO medical monitors) and clinical trial investigators, analyzing the reasons for different evaluations.



Four drug safety experts, who were not medical oncologists, but had experience monitoring oncology trials independently, analyzed 50 consecutive (in terms of time of registration) blinded serious adverse events (SAE) from three Phase II-III clinical trials conducted in 2014-2015. The studies involved in patients with locally-advanced (inoperable) or metastatic non-small cell lung cancer (NSCLC). One of three trials was a placebo-controlled one and two studies were open-label. All patients, including those from the placebo arm, received one or two chemotherapeutic agents. All patients from the open-label studies received an investigational drug as a part of their chemotherapeutic regimen. Among the 37 cases from the placebo-controlled trial, 17 of the patients (46%) received the study drug and chemotherapeutic agent and 20 (54%) were administered the placebo and chemotherapeutic agent. The information for the SAEs was summarized in narratives, which included demography data, medical history, concomitant medications, study treatment with dates (the name of the study drug was blinded), the course of the AE, and the list of expected AEs for the particular study drug. The investigator’s assessment was deleted from the narrative, so the experts were unaware whether or not the AE was assessed by the investigator as an ADR.

For each drug-event pair, the questionnaire asked the experts to assess causal relationship based on the following criteria:

  • Time to onset: incompatible/compatible/highly suggestive

  • Dechallenge: positive/negative/not available

  • Rechallenge: positive/negative/not available

  • Alternative cause: probable/ruled out

  • Risk factors: yes/no

  • Previous information on the drug and adverse experience: yes/no

Experts assessed each criterion and also made an overall causality assessment of “related” or “not related.” Events were considered related if a relationship cannot be ruled out. They also provided the reasons for their evaluation. At the time of evaluation, the experts were not aware of the investigator’s and sponsor’s assessments.

Experts were CRO medical monitors. Their medical background included internal medicine, infectious diseases, pediatric, and intensive care. None of them were medical oncologists, but they received trainings in different aspects of oncology and were medical monitors in various oncology clinical studies. Three experts (experts A, B, and C) had > 10 years of experience as medical monitors. Expert D had six months of experience as a medical monitor. All four have solid experience in clinical practice and expert D has 20 years of experience as a clinical investigator.

Cohen’s kappa agreement coefficient was used to assess agreement of each expert with the original reporting investigator. Contingency tables illustrating the agreement were also created. The results (near-zero values, CI covers zero in all cases) indicate that the investigator assessment is not related to that of any of the experts. This analysis was done for overall assessment only (as we do not have the detailed criteria data for the investigators, just the overall causality assessment). Frequencies of concordant (same for all four experts) and discordant (split into 2 vs. 2 and 3 vs. 1 categories) evaluations were summarized for overall assessment, as well as for the individual Items. Three vs. one assessments were further analyzed to identify the expert who disagreed with the other three. Frequencies of being the disagreeing expert were summarized for the overall causality assessment, as well as for individual criteria. Chi-square test was then used to test the hypothesis that all the experts have equal chances of disagreeing with the three others.



Adverse events were coded in accordance with

. The

most common system-organ-classes (SOC) were respiratory, thoracic, and mediastinal disorders – 11 (22%), and general disorders and administration site conditions – 10 (20%). The proportion of all SOCs is presented in Table 1 at right.

The disease under study – NSCLC, could explain this distribution of AEs, where the most frequent and expected adverse events are related to respiratory and cardiac systems. Another contributing factor is that we assessed serious AEs, which most likely are more clinically significant and, thus, more likely to involve vital organs and functions.

The rate of agreement within the group of experts varied according to the criterion under evaluation (see Table 2 below).

All four experts agreed on causality assessment in 32% of cases and, as expected, the pattern of assessment was different for the causality criteria. Maximal agreement was seen for expectedness (reaction previously reported) and risk factors. Interestingly, assessment of expectedness criterion was based on the data from the Investigator’s Brochure, where all previously registered and expected AEs were enlisted, but nevertheless, in 12% of cases assessment was not unanimous. Assessment of dechallenge and rechallenge was not informative and in the vast majority of cases was recorded as “not applicable.” Only one AE occurred during infusion of the medication and in this case it was possible to assess a dechallenge effect, which was positive, according to all four experts.



There were several AEs where three experts were in agreement, while the fourth had a different opinion. We analyzed the pattern of such disagreement of one expert with three others on a specific AE (see Table 3 below).

Statistically significant differences were seen in evaluation of time to onset, alternative cause, and overall causality assessment.

We did not analyze agreement between the experts and investigators in assessment of separate causality criteria, as only the overall causality assessment was available from the clinical investigators. Overall agreement between all four experts and the investigators in assessment of AEs was 32% (16 cases), however, agreement was much higher for experts A, B, and C, with >10 years of experience in the capacity of medical monitor, at 70%-74%, while for expert D it was only 44% (See breakdown in Table 4 below).

Based on the above, expert D seems to be more liberal in attributing causality as likely related to study drug. This may partially be explained by a difference in perspective, since this expert has been a clinical investigator.

Causality assessments made by the sponsors concurred with those of the investigators in 48 cases out of 50 (96%). In two cases when it was different, the experts’ assessment was the same as the sponsor’s and in the other, two experts concurred with the sponsor and two with the investigator.  

After unblinding of treatment assignments from placebo-controlled trials, it turned out that in four cases out of 20 (20%) where placebo had been given, the investigators assessed causality as “related.” Interexpert agreement in evaluating placebo group also was not unanimous and AEs were assessed as “related” in 1 (expert A), 4 (expert B), 5 (expert C), and 14 (expert D) cases.




More than 30 drugs have been recalled from the US market since the 1970s, some in use for several decades. According to the FDA, “A drug is removed from the market when its risks outweigh its benefits. A drug is usually taken off the market, because of safety issues that cannot be corrected, such as when it is discovered that the drug can cause serious side effects that were not known at the time of approval.”6 Darvocet (propoxyphene), an opioid pain reliever, was recalled in 2010 after being on the market for 55 years, due to serious cardiotoxicity; over 2,000 deaths were reported between 1981 and 1999.7 Duract (bronfenac), a non-steroidal anti-inflammatory drug, was withdrawn in 1998 after being on the market for a few months, due to significant hepatotoxicity; four deaths resulted and eight patients required liver transplants.8 It seems evident that the more comprehensive the information about possible ADRs is obtained during clinical development, the more likely the drug will remain on the market. Thus a careful assessment of AEs and particularly causality are of great importance.

Methods used for assessment of causality can be grouped into three categories: expert judgement, probabilistic approaches, and algorithms or scales.9 Despite numerous methods, there is still no “gold standard” and, to a large extent, assessment is based on physicians’ evaluation. In clinical trials, a probabilistic approach is used-most frequently the WHO-UMC system and its modifications.2 Early studies showed poor agreement between experts in assessment of causality of ADRs. Several assessment criteria are addressed in a “yes-no” fashion: timely relationship, alternative explanation, dechallenge, and rechallenge.

There are six possible outcomes for causality assessments in the WHO-UMC system: certain, probable, possible, unlikely, unclassified, and unassessable. “Certain” has a unique requirement: the “event must be definitive pharmacologically or phenomenologically, i.e., an objective and specific medical disorder or a recognized pharmacological phenomenon.” An example is “grey baby syndrome,” which is a rare but serious side effect that occurs in newborn infants (especially premature babies) following the intravenous administration of the antimicrobial chloramphenicol due to the lack of a hepatic enzyme to metabolize the drug. This criterion is applicable only in cases of already known, previously observed, specific reactions. Such knowledge is not common in clinical trials, especially at early stages of clinical development. However, certain types of reactions specific for a class of medications could meet this criterion during a clinical trial.

For regulatory purposes, the six WHO-UMC causality categories are collapsed into just two categories-related and not related.  Certain, probable, and possible map to “related,” and unlikely, unclassified, and unassessable make up “unrelated.” There are certain difficulties in distinguishing between probable and possible and frequently investigators do not consider them. Accordingly, we decided to simplify the analysis and classify AEs as related or not-related. We also modified assessment criteria and added to traditional time to onset, dechallenge/rechallenge, and alternative cause-such criteria as risk factors and previous information on the drug. Evaluation of temporal relationship is based on a general assumption that it takes between five and six half-lives for a medication to be eliminated from the body; consequently, it is generally accepted that one can exclude a temporal relationship with a medication, if the time since the last administration is ≥ 5 x T1/2. However, this approach is frequently not observed in clinical practice and is a matter of approximation. Another limitation is possible variations of pharmacokinetic parameters, including half-life, due to pathologic conditions and drug-drug interactions. Khan et al. showed that patients with severe or acute respiratory disorders generally use multiple drugs and have increased susceptibility to ADRs.10

Positive dechallenge and rechallenge theoretically could be considered “a gold standard” of causality, but in practice their utility is very limited. Dechallenge could be mainly used for infusion reactions, when stopping drug may, or may not, lead to a quick resolution of clinical signs and symptoms. It is not useful when an ADR occurs some length of time after a drug administration-sometimes days or weeks. Dechallenge also may not be applicable in single-dose studies or long-lasting reactions, like hepatotoxicity or congenital anomalies. A rechallenge rarely occurs due to ethical concerns. An analysis of the performance of the CIOMS scale in the Spanish DILI (drug-induced liver injury) Registry showed that rechallenge data were absent in >95% of all cases.11 For drugs that are administered with a certain periodicity, repetition of the ADR after one of the subsequent administrations may reflect a positive re-challenge. At the same time, there are two main limitations: the incidence of even drug-related reactions is not 100%, thus if seen after one administration, the ADR will not necessarily repeat after the next one, and some drugs are given once weekly or even more rarely (e.g., chemotherapeutic medications) and this criterion could be assessed only retrospectively. In our study, only one ADR out of 50 was assessed for dechallenge and it was a psychiatric disorder that occurred during drug infusion and resolved after the drug was discontinued. As the relationship was obvious, the drug was not resumed and thus rechallenge was not applicable.  

Under the category of “previous information on the drug and adverse experience,” we meant information about the drug that was available at the time of the ADR. For investigational drugs, investigators receive this information from a periodically updated Investigator’s Brochure. Usually it is updated annually and for many shorter studies updates occur after study completion. Another source of ADR-related information was providing investigators with individual case safety reports (ICSRs) delivered as MedWatch 3500A or CIOMS-I forms during the study conduct. However, these forms are being distributed for the events that have already been considered SUSARs (suspected unexpected serious adverse reactions). Those that were considered not related are not provided to investigators.

Interestingly, in 20% of patients who later appeared to receive placebo, causality was assessed as related by the investigators, and discordance between the experts was also high. A detailed analysis of this phenomenon is beyond the scope of this article, but we can theorize, that in placebo-controlled trials, evaluation of AEs should be made with more caution, as there is a chance that a patient was receiving a placebo instead of an active drug, and unblinded safety data received during the study can’t adequately characterize the safety profile of a drug.

Divergences between experts in ADRs causality assessment have been reported previously. Arimone et al. compared judgment of five experts using VAS score and confirmed marked interexpert disagreement (kappa=0.20).12 In the study of Karch et al., agreement between three clinical pharmacologists in assessing ADRs was 50% and complete agreement between them and the treating physicians was slightly lower, in 47% of cases.13Arimone et al. reported the rate of agreement between experts with kappa indices of the causality criteria ranged from 0.12 to 0.38.14 In the study of Louik et al., four experts rated 50 case reports first using only general guidelines and showed poor agreement.15 Influence of subjective judgement was shown by Miremont et al., who compared physicians’ opinion with the scores obtained by the causality assessment method.16 They showed that physicians more frequently assessed causality as “likely” and “very likely” related and complete agreement between physicians and causality assessment method was achieved in only 6% of cases. Previously, we showed discordance in causality assessment between investigators and a retrospective evaluation with Naranjo algorithm.17



In our present study, an expert with great experience as investigator and less experience as medical monitor, showed significantly lower agreement with investigators than three other experts, in whom the rate of agreement with investigators was 70%-74%. The actual agreement is illustrated by kappa scores, and they are all low, meaning investigators and experts are probably using different approaches. The high percentage of concordant records does not indicate agreement, but it indicates that both the expert and the investigator are considering the majority of events as “related.” The criteria of time to onset and alternative cause of the event were the most frequent reasons for disagreement between the experts. On the other hand, criterion of previously reported ADR, which was based on the factual information (Investigator’s Brochure), caused very few discordant evaluations. The overall agreement on causality assessment was low, at 32%.

Consensus between experts is difficult to obtain when evaluations are made based on personal judgements of unspecified criteria. Developing methods of standardized assessments could solve this problem. In an attempt to minimize the subjective component, several algorithms for evaluation of ADR causality have been proposed: the Naranjo criteria, the Kramer algorithm, the Jones’ algorithm, the Yale algorithm, and several others.18 However, existing algorithms also show significant disagreement in assessing causality of the same ADRs with the most frequently seen discordance in evaluating timing of event, dechallenge, and alternative cause.19 The most frequently used causality assessment algorithm is the Naranjo tool, which is simple and convenient, but its validity has been variably assessed; some studies demonstrated good agreement,20 while others questioned its reliability.9,21

Considering the low validity and reproducibility of general methods for assessing ADRs, several groups attempted to develop algorithms for evaluating certain types of ADRs. Several algorithms have been suggested for causality assessment of drug-induced liver injury (DILI): Maria and Victorino, DDW-J, and CIOMS-RUCAM (Roussel Uclaf Causality Assessment Method).22,23 At the same time, causality assessment in different pathological conditions (e.g., lung cancer and Crohn’s disease) is evaluated using the same general approaches. Both patient characteristics (e.g., age, comorbidities, and concomitant treatment) and disease characteristics (e.g., tumor burden, specific laboratory parameters) may influence incidence and severity of ADRs. At present, none of the existing causality assessment tools takes any of them into account.



 Our study confirmed that the overall agreement between clinical investigators and drug safety experts is low. Disagreements between experts may be due to different clinical background, perspective, and expertise. At the same time, we observed disagreement in assessment of such objective criterion as previously reported reactions. Our findings concluded that assessment of AEs’ causality is subjective and influenced by individual judgement of investigator/expert, and, thus, does not reflect a safety profile of the drug. It is reasonable that safety assessments are made on an aggregate data.


Maxim Kosov, MD, PhD, is Director of Medical Monitoring and Consulting, PSI CRO, United States; Alexey Maximovich, is Group Leader, Biostatistics, PSI CRO, Russia; John Riefler, MD, is Director of Medical Monitoring, PSI Pharma Support America; Eric Batson, is Director Medical Monitoring and Consulting, PSI Pharma Support America; Maria Cecilia Dignani, MD, is Medical Officer, PSI CRO, Argentina; Maxim Belotserkovskiy, MD, is Senior Director Medical Affairs, PSI CRO Deutschland, a subsidiary of PSI CRO AG



1. Glossary of terms used in Pharmacovigilance. Assessed from:

2. The use of the WHO-UMC system for standardized case causality assessment. Assessed from:

3. Belhekar Makesh, Taur Santosh, and Munshi Renuka. A study of agreement between the Naranjo algorithm and WHO-UMC criteria for causality assessment of adverse drug reactions // Indian J Pharmacol, 2014; 46 (1): 117-120, doi: 10.4103/0253-7613.125192

4. Macedo Ana Filipa, Marques Francisco batel, Ribeiro Carlos Fontes, and Teixeria Frederico. Causality assessment of adverse drug reactions: comparison of the results obtained from published decisional algorithms and from the evaluations of an expert panel //Pharmacoepidemiol Drug Saf, 2005; 14: 885-890, doi: 10.1002/pds.1138

5. Son M, Lee Y, Jung H, et al. Comparison of the Naranjo and WHO-uppsala monitoring center criteria for causality assessment of adverse drug reactions //Korean J Med. 2008; 74: 181-187

6. How does FDA decide when a drug is not safe enough to stay on the market?

7. FDA Drug Safety Communication: FDA recommends against the continued use of propoxyphene.

8. Hunter Ellen, Johnson Philip, Tanner Gordon et al. Bromfenac (Duract)-associated hepatic failure requiring liver transplantation. Am J Gastroenterol, 1999; 94: 2299-2301, doi: 10.1111/j.1572-0241.1999.01321.x

9. Agbabiaka TB, Savovic J, Ernst E. Methods for causality assessment of adverse drug reactions: a systematic review. Drug Saf. 2008; 31:21-37

10. Khan Amer, Adil Mir, Nematullah K. et al. Causality assessment of adverse drug reaction in pulmonology department of a tertiary care hospital. J Basic Clin Pharmacy. 2015; 6: 84-88, doi: 10.4103/0976-0105.160744

11. Andrade Raul, Robles Mercedes, Lucena M Isabel. Rechallenge in drug-induced liver injury: the attractive hazard. Expert Opin Drug Saf. 2009; 8: 709-714

12. Arimone Yannick, Begaud B, Miremont-Salame G. et al. Agreement of expert judgment in causality assessment of adverse drug reactions. Eur J Clin Pharmacol. 2005; 61: 169-173

13. Karch F, Smith C, Kerzner B. et al. Adverse drug reactions – a matter of opinion. Clin Pharmacol Ther. 1976; 19: 489-492

14. Arimone Yannick, Miremont-Salame Ghada, Haramburu Francoise. et al. Inter-expert agreement of seven criteria in causality assessment of adverse drug reactions. Br J Clin Pharmacol. 2007; 64: 482-488, doi: 10.1111/j:1365-2125.2007.02937.x

15. Louik Carol, Lacouture Peter, Mitchell Allen et al. A study of adverse reaction algorithms in a drug surveillance program. Clin Pharmacol Ther. 1985; 38: 183-187

16. Miremont G, Haramburu F, Begaud B. et al. Adverse drug reactions: physicians’ opinions versus a causality assessment method. Eur J Clin Pharmacol. 1994; 46: 285-289.

17. Kosov Maxim, Riefler John, Belotserkovsiy Maxim. Facing adversity. Eur Pharm Contractor. 2015; Jul: 32-35

18. Srinivasan R, Ramya G. Adverse drug reaction – causality assessment. Int J Res Pharm Chem. 2011; 1: 606-612

19. Pere J, Begaud B, HaramburuF, Albin H. Computerized comparison of six adverse drug reaction assessment procedures. Clin Pharmacol Ther. 1986; 40: 451-461

20. Kane-Gill S, Forsberg E, Verrico Margaret, Handler S. Comparison of three pharmacovigilance algorithms in the ICU settings: a retrospective and prospective evaluation of ADRs. Drug Saf. 2012; 35: 645-653, doi: 10.2165/11599730-000000000-00000

21. Kane-Gill Sandra, Krisci Levent, and Pathak Dev. Are the Naranjo criteria reliable and valid for determination of adverse drug reactions in the intensive care unit? Annals of Pharmacotherapy. 2005; 1823-1827

22. Garcia-Cortes Miren, Stephens Camilla, Lucena M. et al. Causality assessment methods in drug induced liver injury: strengths and weaknesses. J Hepatol. 2011; 55: 683-691

23. Regev Arie, Seeff Leonard, Merz Michael et al. Causality assessment for suspected DILI during clinical phases of drug development. 2014; 37: 47-56, doi: 10.1007/s40264-014-0185-4