OR WAIT null SECS
Examining early learnings from approaches used to comply with EMA’s requirement to publish anonymized versions of clinical study reports and other submission documents, including how privacy protection was balanced against data utility.
On Oct. 2, 2014, the European Medicines Agency (EMA) published Policy 0070,1 which required pharmaceutical companies to provide the agency with anonymized clinical trial information subsequent to a decision through the centralized marketing authorization procedure. A two-phase, stepwise approach was adopted, where phase one consists of marketing authorization holders (MAHs) submitting anonymized clinical reports, and phase two consists of MAHs submitting anonymized structured patient level listings. Phase one is already in place and applies to procedures submitted from Jan. 1, 2015. Phase two is expected to commence at some point in the near future. Our focus in this article is only on phase one.
The EMA will then make these clinical reports available for public sharing through its clinical data portal.2 These Policy 0070 submissions must include an anonymization report as well, which describes the methods used to anonymize the clinical reports.
When a manufacturer applies for a centralized marketing authorization, the Committee for Medicinal Products for Human Use (CHMP) provides the (positive or negative) recommendation to the European Commission (EC). The EC grants or refuses the marketing authorization in a centralized procedure. The anonymized clinical reports will be published after the EC decision, or the CHMP decision if there is no EC decision.
In March 2016, the EMA published a detailed set of guidelines for the anonymization of these clinical reports,3 and provided a template for the accompanying anonymization report. In December of the same year, the agency published a set of changes to the guidelines as well as an updated guidelines document.4,5 Starting in October 2016, the agency made these clinical reports, with the accompanying anonymization reports, available on their clinical data portal.2 Data after approximately the first month from launching the portal indicate that 234 academic/non-commercial research users registered on the portal, and 1,017 general users registered.6
The purpose of this article is to provide a descriptive analysis of what we have learned from the data releases over the first three months in terms of approaches to anonymization, how the EMA anonymization guidelines are being implemented by manufacturers, and how patient privacy is balanced against data utility.
The objective of this analysis is to answer three questions:
Background for each is covered in detail ahead.
Data source: Submissions posted on the EMA portal
We examined procedures that had been posted on the portal by the end of January 2017. A summary of the eight procedures we examined is provided in Table 1. Information was extracted from the anonymization reports and the clinical reports to answer the three questions above.
Out of these eight procedures on the portal, three of them did not include any protected patient data (PPD): Caspofungin, Armisarte, and Palonosetron Hospira. These were not novel molecular entities. Therefore, we will not consider these three procedures much further in our analysis. Aripiprazole is a withdrawn application for a marketing authorization, in contrast to the others that received a positive opinion.
A posted procedure (henceforth referred to as “submission”) includes the clinical reports as well as an anonymization report. In our analysis, we reviewed the anonymization reports and performed random checks through the clinical reports to confirm that the actual implementation matched the descriptions in the anonymization report, and looked for counter examples for confirmation.
Currently, once a CHMP opinion is received by the manufacturer, the EMA indicates the deadline for sending their anonymized documents. The agency has said that it is sending notification letters, processing Policy 0070 submissions, and posting the anonymized submissions on the clinical data portal in the chronological order of the opinion dates.7 As can be seen from Table 1, the current tempo of data releases seems to be two submissions a month.1
It would be reasonable to assume that posting submissions on the portal will continue at this rate or accelerate over time. However, there are approximately half a dozen or more CHMP opinions every month. Therefore, unless the EMA is able to publish submissions at a higher rate than two a month, the backlog is likely to grow in size, resulting in longer delays between decision and posting on the portal.
To accelerate the publication, the agency could increase the staff working on the implementation of Policy 0070. But there have been concerns that the EMA is challenged in hiring new staff.8 In a recent webinar, EMA representatives did indicate that they are staffing the Policy 0070 efforts through internal transitions of new staff, which can be slow because this affects other functions in the Agency.7 Therefore, failing a significant resource ramp-up, process efficiencies will be necessary to avoid a growing backlog and consequent delays.
Approaches to anonymization
The EMA defines anonymization as “The process of rendering data into a form which does not identify individuals and where identification is not likely to take place.”3 At a conceptual level, we can characterize the anonymization approaches described in the EMA guidance as falling into two dimensions: (a) method for meeting EU regulatory requirements, and (b) analytical method. This characterization is descriptive in that it reflects the approaches that are currently in use in the submissions, and prescriptive in that it covers approaches that have been recommended by the EMA (but not yet represented in the submissions on the clinical data portal). Therefore, the two dimensions are a pragmatic coverage of the universe of known approaches to anonymization. These are further clarified below.
Method for meeting EU regulatory requirements
The Article 29 Working Party, which is composed of representatives from EU country data protection authorities and the European Data Protection Supervisor, published an opinion on acceptable anonymization approaches in 2014.9 In general, the Article 29 Working Party opinions provide interpretations of EU data protection statutes.
This opinion describes two general approaches to anonymization. When applied to Policy 0070, the first approach entails the manufacturer meeting the following three criteria:
These criteria have been criticized and an argument has been made that it would be very challenging to produce information that has much utility if they are utilized for anonymization.10 Specifically, the removal of longitudinal data and disabling the capacity for statistical inference from the data severely limits what can be done with the data.
In its guidance. the EMA has alluded to the limitations of these three criteria as well. For example, the Agency noted:
The second approach recommended by the Article 29 Working Party is to perform a risk assessment. This approach receives considerable coverage in the EMA guidance documents.
The second dimension, the analytical method used for determining the appropriate anonymization, can be one of three possibilities:
The qualitative and subjective approaches are also sometimes referred to collectively as “non-analytical approaches” to anonymization in the documentation included in the submissions on the EMA portal.
Analysis of approaches to anonymization
When we cross-tabulate the two dimensions of meeting the regulatory requirements and analytical method, we get a set of six possible approaches that can be used by a manufacturer to demonstrate that the likelihood of re-identification of patients in the anonymized clinical reports is acceptably small, as illustrated in Table 2. The five submissions were classified according to one of the six possible approaches in Table 2.
Three types of information have to be identified in order to implement an anonymization scheme and to meet the requirements in the EMA guidance. We will discuss these identifiers in this section.
Patient Information vs Staff Information
The EMA anonymization guidance pertains to patients / trial participants, and the risk assessment that is used would pertain to patient information. However, clinical reports contain information about investigators, sponsors, document authors, and study staff. The EMA guidance requires that the names and sites of the sponsor, coordinating investigator, and the investigators who conducted the study should be kept in the document. Their contact details and signatures should be redacted. All other information on site, sponsor, and vendor staff should be redacted.
Direct identifiers are personal information that can uniquely identify an individual patient. These types of identifiers would need to be either redacted or pseudonymized as part of the anonymization process.
Quasi-identifiers are also considered personal information. Combinations of quasi-identifiers incrementally increase the risk of re-identification. Not all quasi-identifiers need to be transformed in order to protect patient privacy. The sponsor would need to determine which quasi-identifiers for which patients must be transformed to ensure that the risk of re-identification is acceptably small.
Approaches to ensuring data utility
A key objective of Policy 0070 is to ensure that the anonymized clinical reports that are posted on the clinical data portal retain sufficient data utility for secondary analysis. For example, the EMA states:
“Taking into account the need to find the best balance between data utility and achieving an acceptably low risk of re-identification, what EMA ultimately would like to achieve is to retain a maximum of scientifically useful information on medicinal products for the benefit of the public while achieving adequate anonymization.” (Chapter 3, section 5.1)3
Furthermore, the Policy 0070 document itself emphasizes the need to make detailed clinical data available:
“A high degree of transparency will take regulatory decision-making one step closer to EU citizens, and promote better-informed use of medicines. In addition, the Agency takes the view that access to clinical data will benefit public health in future. The policy has the potential to make medicine development more efficient by establishing a level playing field that allows all medicine developers to learn from past successes and failures. Furthermore, it will enable the wider scientific community to make use of detailed clinical data to develop new knowledge in the interest of public health. Access to clinical data will allow third parties to verify the original analysis and conclusions, to conduct further analyses, and to examine the regulatory authority’s positions and challenge them where appropriate.” (Section 4.1)1
While the EMA guidance emphasizes the need to verify original analysis and conclusions, the evidence from voluntary data sharing efforts that have been running over the last few years suggest that the validation of the primary endpoint is an uncommon objective of secondary analysis of clinical trial data, and that the most common purposes for secondary analyses are additional analyses of the treatment effect and the disease state.16 Although, one can also argue that any secondary analysis would start off with replicating published results to verify that the data is correct and understood, even if that replication is not published. Therefore, the EMA does anticipate broad uses of the information that is posted on the portal.
We provide a descriptive summary of how data utility was maximized and assessed in the five submissions on the portal.
ResultsApproaches to anonymization
A summary of the anonymization approaches that have been used on the five data releases with PPD is provided in Table 3. As can be seen, no quantitative approaches have been used thus far. The anonymization of the Aripiprazole clinical reports relied on the three criteria from the Article 29 Working Party to justify the anonymization that was applied. The remaining four utilized a non-analytical risk-based approach.
Given the anonymization methods that have been applied, the question is whether it is possible to ensure that the risk of re-identification is sufficiently small. Under a quantitative approach, the EMA guidance has recommended a risk threshold of 0.09, which is consistent with precedents for public data release.12 That threshold value applies to “maximum risk,” which means that the risk of re-identification for each patient in the clinical reports is measured and taking the maximum value across all of them. This maximum value should be at or below the threshold. With qualitative and subjective risk assessments, the risk is not measured and, therefore, it is not possible to demonstrate quantitatively that the actual risk is below the threshold.
There are two approaches that can be used to transform the clinical reports to anonymize them: redaction and replacement (also known as re-synthesis). Redaction entails covering the staff information and PPD in the reports with a blue box so that the text below is not visible. Replacement involves replacing the staff information and PPD with other values. For example, a site staff member’s name can be replaced by another name, a subject ID with a pseudonymized subject ID and a date with an offset date. In all of the five relevant submissions posted on the portal, redaction was used. No examples of replacement, as defined earlier, were observed in the clinical reports or documented in the anonymization reports.
Furthermore, redaction methods are not going to be 100% accurate in finding PPD. For example, in the Zurampic clinical reports we identified PPD (as defined in the manufacturer’s anonymization report), such as dates, that were not redacted. These kinds of “misses” are inevitable with such large documents. No assessment of the frequency or the impact of the misses on the risk of re-identification was performed.11Direct and quasi-identifiersPatient information vs. staff information
As noted earlier, the EMA guidance requires that the names and sites of the sponsor, coordinating investigator, and the investigators who conducted the study should be kept in the document. Their contact details and signatures should be redacted. All other information on site, sponsor, and vendor staff should be redacted.
This is the general approach followed in the documents posted on the clinical data portal, with the exception of the Praxbind and Tarceva submissions. In the former case, the manufacturer argued in the anonymization report that personal investigator information cannot be shared without consent and no such consent was provided in the contracts and agreements with these individuals, and, therefore, it is not permissible to release that information without their consent. Consequently, all of that information has been redacted as well. The exact types of staff information that is redacted in the different submissions are summarized in Table 4.
The types of direct identifiers that have been used in the current submissions are summarized in Table 5. There is considerable consistency across the four sets of clinical reports in their definitions of what is a direct identifier.
The definitions of quasi-identifiers from the submissions that have been posted on the EMA portal are summarized in Table 6. As can be seen, the definitions are somewhat consistent across the submissions.
Note that because a piece of information is defined as a quasi-identifier, that does not mean that that information was redacted in the documents. We assume that all the quasi-identifiers were considered in the non-analytical risk assessment, and some of them were subsequently redacted. Furthermore, because in some instances large blocks of text were redacted, it was not always possible for us to verify with certainty what type of quasi-identifier was redacted.
The Praxbind analysis relied on the U.S. Department of Health and Human Services criteria for identifying quasi-identifiers:13 “replicability,” “distinguishability,” and “data source availability.” No other specific formal process was used for the selection of the quasi-identifiers across the submissions. However, the resultant list of quasi-identifiers is consistent with the recommendations in the PhUSE standard14 and the Institute of Medicine report.15Approaches to ensuring data utility
None of the five submissions performed a formal analysis of data utility. However, we can examine the perspectives and arguments that were made by the sponsors with respect to the utility of the anonymized documents.
Some of the submissions posted on the clinical data portal have expressed reservations about the utility of the clinical reports after the redactions that were performed on them, while others made the case that data utility was adequate for the anticipated purposes. The statements made in the published submissions regarding data utility are summarized in Table 7.
The redaction of two types of information in the clinical reports would have a non-trivial impact on data utility: case narratives and subject IDs. This is evidenced by the EMA specifically identifying these two types of transformations in their guidance.
The EMA has noted in their guidance that case narratives should not be redacted:
“Case narratives should not be removed nor redacted in full regardless of their location in the clinical reports (body of the report or listings). They should be, instead, anonymized. Regardless of the anonymization technique used by the applicant/MAH, EMA cannot accept the redaction of the entire case narrative by default (as a rule).” (Chapter 2, Section 2.2)3
There is evidence that narratives help researchers provide a more accurate estimate of harms. The benefits of access to narratives include:18,19,20 identifying inconsistencies in the reporting of SAEs within the CSR and between the CSR and publically reported data (in clinical trial registries and journal publications), help understand more precisely when adverse events occurred (e.g., before randomization vs. while the patient was receiving the study drug), some SAEs are identifiable only from the narratives (e.g., verbatim terms that are not coded), and there are cases where the coding of verbatim terms is revised by researchers to reflect more modern coding practices and dictionaries, or a different interpretation (for example, in terms of reasons for discontinuation in a trial).
Only the Zurampic submission did not completely redact case narratives (Aripiprazole did not have case narratives). Rather, the anonymization report stated that verbatim text was redacted; this was not performed completely in the clinical reports.
Another specific consideration with respect to data utility pertains to the redaction of subject IDs. The EMA has noted that:
“On the other hand, the value of the data is significantly reduced where the ability to follow a patient across visits and events is broken. The risk of linking the information for the same individual can be measured and net effect on risk can be determined.” (Chapter 3, Section 22.214.171.124)3
All five submissions have redacted subject IDs in the anonymized documents.
DiscussionConsistency with the EMA anonymization guidance
In this section we summarize the extent to which the different manufacturers have followed the EMA guidance. (see Table 8). Clearly the EMA has agreed to publish redacted clinical reports where the redaction was inconsistent with their guidance. The question is how should we interpret that to guide future anonymization efforts for policy 0070.
At this point we can conclude that the methods currently in use reflect the contemporary level of expertise in anonymization within industry. As manufacturers gain experience with the anonymization of clinical reports, and as the re-identification risks specifically of clinical trial information become better understood, the expectation is that more robust methods will be utilized to anonymize.
It was evident from statements made by the EMA that the agency is quite concerned about meeting its transparency objectives and the public’s perception of the agency meeting these objectives. For example, the most recent changes to the EMA guidance requires the replacement of pages with a sheet indicating that certain pages were removed, rather than having multiple pages included that are completely redacted.4 The argument made was that the public perception of a large number of redacted or “blacked out” pages would be negative.7
Therefore, it seems that in an effort to maintain the balance between these two objectives, the EMA will not prematurely enforce practices that would limit the ability to share clinical reports, but at the same time encourage companies to improve adherence to the guidelines.
At a recent presentation, the EMA noted that it will publish an annual report summarizing the implementation of Policy 0070. In the report, the agency said it will list the names of the companies that are deemed to be non-compliant.25 Having its name on such a list could cause reputational harm to a company, and this would be the incentive not to be on that list. Arguably, it would be safe to assume that non-compliance means at least: (a) not submitting an anonymized report or an anonymized version of the clinical reports, and (b) companies not cooperating with the EMA in discussions pertaining to anonymization. It is not clear at this point whether non-compliance with some aspects of the guidance would be another reason to be placed on this list.
Balancing data utility and patient privacy
It is evident from the redaction approaches that have been applied thus far that the manufacturers have erred toward being more conservative and tilting the privacy/utility balance toward protecting patient privacy. A strong focus on privacy is not surprising given that the protection of patient privacy is the responsibility of the manufacturers, and the EMA guidance makes that point:
“Furthermore, the processing of personal data and its publication on the website by EMA is subject to the provisions of Regulation (EC) No 45/2001 and in particular is limited only to information that is adequate, relevant and not excessive for the purpose of transparency. It is important to recall that no personal data of trial participants should be published. […] This guidance document is without prejudice to the obligations of pharmaceutical companies as controllers of personal data under applicable national legislation on the protection of personal data." (Chapter 3, Section 1)3
While the EMA recognized non-analytical approaches to re-identification risk assessment in its guidelines, because no quantitative re-identification risk measurements were made, there is no strong assurance that re-identification risk was indeed below a generally accepted threshold (the EMA recommended a quantitative threshold of 0.09, for example3), and that there may still be residual risk in the information that was not redacted. Even if there remains theoretical uncertainty as to whether the balancing that has been performed thus far has achieved demonstrable patient privacy protection, given the extensive redaction of PPD that has been applied, it is likely that the risk of re-identification in the currently posted clinical reports is low.
However, the extensive application of redaction would also result in reduced data utility for the shared documents. The EMA recognized this in its guidance, and has made clear that the expectation is that manufacturers will shift away from that approach over time:
“EMA understands that in an initial phase redaction techniques are likely to be used by applicants/MAHs, taking into account that for a certain period, pharmaceutical companies will have to anonymize their data retrospectively […]. Importantly, redaction alone is more likely to decrease the clinical utility of the data compared to other techniques. Therefore, EMA is of the view that applicants/MAHs, after experience has been accumulated in the de-identification of clinical reports, should transition to other anonymization techniques that are more favored in order to optimize the clinical usefulness of the data published […]. Pharmaceutical companies are encouraged to use these anonymization techniques as soon as possible, whilst ensuring data anonymization is achieved.” (Chapter 3, Section 5.1)3
The EMA, therefore, did anticipate that improvements in data utility will be incremental and would happen over time.
Justifying reduced data utility
There were three important points that were made in the posted anonymization reports to justify the conservatism negatively impacting data utility: (a) the newness of anonymization, (b) the clinical reports were already produced, and (c) technological advances require conservatism. We discuss each of these ahead.
A commonly heard argument, and mentioned in some of the posted anonymization reports, is that anonymization methods are new and, therefore, there is a learning curve to applying them to clinical trial information. However, it should be noted that the discipline of statistical disclosure control has been around for a number of decades,26-32 and additional citations to that body of work specific to health data can be found in the Institute of Medicine report on sharing clinical trial data.15 The overall discipline is not new. The application of disclosure control methods to clinical trial data is recent, but there is a very large body of work to draw from which should accelerate the transition of knowledge to solving this problem.
Another argument is that the sharing of anonymized documents publicly is a recent practice. However, in the context of access to information (ATI) or freedom of information (FOI) requests, where citizens can request documents from government departments, the sharing of anonymized documents has been going on for decades. This type of disclosure is effectively a public release of information. Although, historically, government departments have often used redaction, as opposed to re-synthesis, to anonymize the documents that they release pursuant to ATI or FOI requests.
In the anonymization reports for three of the submissions (Zurampic, Kyprolis, and Tarceva) it was noted that the clinical reports were produced before the EMA policy came into effect, and, therefore, the argument was made that the only mechanism available was to redact information in the pre-existing documents. The assumption here is that other anonymization methods described in the EMA guidance (such as pseudonymization and generalization) would have to be applied during the development of the scientific clinical reports and could not be applied to existing clinical reports that were already completed. Based on our experiences, this assumption is not true in that re-synthesis techniques can be applied post facto as well.
A third argument that has been made in some of the anonymization reports is that technological advances would increase re-identification risk for publicly released information. For example, the Praxiband report highlights this as a contextual factor that has influenced the approach to anonymization. The EMA guidance does make the point that technological advances should be considered during the anonymization:
“MAHs/applicants need to take into account (realistic) future developments in terms of availability of data and technologies that would allow identification.” (Chapter 3, Section 3.3)3
“[…] the data controller must continuously follow developments in re-identification techniques, and if necessary, reassess the risk of re-identification. Applicants/MAHs […] will need to take this aspect into consideration and to monitor continuously the development of technologies in this area in order to assess novel risks of re-identification for any future clinical reports published.” (chapter 3, Section 4.4)3
The EMA recommended the use of “maximum risk” as a measure of re-identification risk for the anonymized clinical reports.3 This metric is already quite conservative in that it assigns a risk level to all of the patients based on the risk level of the highest risk patient. Furthermore, the 0.09 threshold recommended by the agency3 is consistent with the more conservative side of precedents for public data releases.12 Therefore, arguably, the current guidance has already some built-in conservatism to account for technological changes.
In the Zurampic anonymization report, the manufacturer reserved its right to update the published reports if technological advances were deemed to increase the re-identification risk. It is not clear that the EMA would facilitate such an update since that was not something that it had publicly agreed to do.
Proposed remedies for reduced data utility
The submissions have proposed some remedies or made the case that any reduction in data utility would not be detrimental to the usefulness of the clinical reports. These arguments are also summarized in Table 7. We examine these arguments below.
One proposed approach to mitigate reduced data utility mentioned in the Zurampic anonymization report was that qualified researchers can request more detailed information directly from the manufacturers. However, there is evidence that this route does not always work,21 and can be time-consuming (as opposed to just downloading documents immediately from the clinical data portal). An alternative is to request the same documents from the EMA through an ATI request.22 Responses to these requests can also be time-consuming with wide variation in, and sometimes lengthy, response times.23,24 Finally, the EMA has rejected that argument as a justification for reduced data utility as it defeats the EMA’s transparency objectives with Policy 0070.7
The second proposed mitigation for reduced utility clinical reports is that the aggregate data is not redacted and that aggregate data has sufficient utility for secondary analyses. Some limited aggregate data are also available in clinical trial registries and journal publications. Therefore, it is a question whether the incremental aggregate details available in the anonymized clinical reports would allow more innovative analyses compared to what is already possible. Also, recall that narratives have been useful in secondary analysis of clinical reports,18,19,20 which means that at least for some types of studies the aggregate data will not be sufficient.
In this article we provided a descriptive summary of the clinical report submissions that have been published by the EMA pursuant to Policy 0070, and an analysis of key learnings from these. The learnings pertain to the approaches that were used to anonymize the clinical reports, how privacy protection was balanced against data utility, and the extent to which they were consistent with the EMA anonymization guidance.
In general, the current submissions published on the EMA clinical data portal have followed a conservative anonymization approach that emphasized privacy protection over data utility. There is a real need for manufacturers to accelerate the adoption of more sophisticated risk-based and quantitative anonymization techniques that would allow for both higher data utility and strong assurances that patient privacy has been protected, and for the EMA to create the appropriate incentives for this to happen.
Khaled El Emam, PhD, is CEO, Privacy Analytics Inc., Professor, University of Ottawa, and Senior Investigator, Children’s Hospital of Eastern Ontario Research Institute; email: firstname.lastname@example.org
* Acknowledgements: The author would like to thank the many people from Privacy Analytics, QuintilesIMS, manufacturers, and regulators who have reviewed drafts of this article and provided valuable comments, as well as the critical feedback from the anonymous reviewers.