Sharing Anonymized and Functionally Effective (SAFE) Data Standard for Safely Sharing Rich Clinical Trial Data

Applied Clinical TrialsApplied Clinical Trials-08-01-2022
Volume 31
Issue 7/8

Presenting how data transformation is measured as part of the SAFE Data Standard, which data variables influence the rating and, how the appropriate level of data transformation is calculated.


The need for clinical trial transparency and data sharing

The clinical trial transparency landscape has been evolving, with rising expectations of openness and disclosure by all trial sponsors. Transparency brings benefits to trial participants, clinical trial sponsors, regulators, the scientific community, and, ultimately, patients.1-4

Disclosing trial information can inform funders and researchers on what trials are needed, enable better decision making by those who use evidence from trials, support a more robust system of quality, and foster trust between the public, sponsors, and regulators. Moreover, reuses of trial data by sponsors can improve the speed and effectiveness of future R&D.5 For example, combining data from trials can enable meta-analysis and the evaluation of new hypotheses, support improvements to trial design, foster innovation using artificial intelligence and machine learning, and enable other analysis and data science (eg, supporting pharmacoepidemiology by identifying participant groups who are at increased risk of an adverse event associated with a drug).

Research on the views of trial participants supports efforts to facilitate the sharing of clinical trial data. Although neither trial participants nor the public were consulted for this paper, participants generally believe that reuse of their data will speed up the research process and lead to greater scientific benefit.2 They are also generally comfortable when trial data is reused and even expect that this will happen.

Deciding whether and when to share data

In some cases, disclosing anonymized trial information is required for market authorization, such as clinical study document publication under the European Medicines Agency (EMA) Policy 0070 and Health Canada’s Public Release of Clinical Information (PRCI); however, in most cases, sharing or reusing anonymized trial data is voluntary and remains an important consideration for the sponsor.6 Sponsors should consider the benefits and ethical considerations to data sharing, recognizing reputational benefits and risks. Sharing and reusing data can earn trust with stakeholders by bringing new life to trial data, furthering trial transparency and supporting health research while easing the burden on trial subjects.7 At the same time, ethically questionable uses of data can erode trust, even when privacy is protected with any identifying information of trial participants removed.

The SAFE Data Standard (a standard for Sharing Anonymous and Functionally Effective data) focuses on protecting privacy to share non-identifiable clinical trial data for ethical secondary uses and disclosures. Anonymization provides a key privacy enhancing tool to enable responsible data sharing, and trial sponsors should ensure such data are otherwise ethically used and consistent with broader organizational objectives.8

Anonymization to enable responsible data sharing

To share or reuse trial data while protecting the privacy of trial participants, trial sponsors first anonymize the data.4,9 While the primary focus of this SAFE Data Standard is on structured individual participant data given its analysis-friendly format, the term “data” and the use of the SAFE Data Standard can apply more broadly to other information collected or produced during a clinical trial, including clinical study documents. As an example, under the EMA’s Policy 0070 and Health Canada’s Public Release of Clinical Information,10,11 trial sponsors are required to anonymize clinical study documents so that they can be published on the EMA and Health Canada data access portals, respectively.

Statistical (or quantitative risk-based) anonymization measures the probability of re-identifying individuals through indirectly-identifying pieces of information—such as demographic information, medical history, and medical event dates—and then reduces this probability through the use of various data transformations, such as shifting dates, generalizing disease classifications or demographic values, or removing (suppressing) outlier values in the data.12

The anonymization process renders data non-identifiable, such that the probability of re-identifying trial participants in the data is rendered sufficiently low.13,14 Identifiability can be viewed along a spectrum.15,16 As the data are increasingly transformed, the identifiability of the data is gradually reduced until it reaches a level that is below the applicable anonymization threshold. At this point, the data are no longer identifiable. The appropriate threshold is determined based on data disclosure precedents, industry benchmarks, and/or regulatory guidance. For example, both the EMA and Health Canada recommend a threshold of 0.09,11,17 which is equivalent to having at least 11 similarly looking individuals in every group based on the well-understood and adopted concept of cell-size rules and k-anonymity.18

Many of the publicized re-identification attacks pertain to data that were minimally transformed or pseudonymized, with no other controls in place.19-21 These examples demonstrate potential vulnerabilities and, as with any scientific discipline, serve as evidence to inform and evolve the field.22 Statistical anonymization in consideration of all data variables and applicable technical and organizational controls is consistent with best practices and regulatory guidelines. Novartis commissioned a motivated intruder test to evaluate the strength of anonymization privacy protection for EMA Policy 0070, publishing results with zero high confidence matches despite high effort per record expended.23

The level of identifiability in the trial data is determined by the similarity of participants in the data compared to the population. But contextual factors also matter. The more data a researcher has available to link or combine with the trial data, and the less restricted the use and environment of the trial data, the more likely re-identification becomes.

This generalized concept of breaking identifiability down into sub-dimensions (eg, data and context) along a spectrum has been used in disclosure control to guide data access decision making for years.24-27 For example, the Five Safes28 framework has been widely adopted and can be used in a process of anonymization to balance multiple considerations in making data safely accessible.29,30 Extending the concept to clinical trial data disclosure, particularly given the global context in which clinical trials are conducted across privacy jurisdictions, can empower organizations to make decisions more efficiently and consistently. Despite a wide range of global stakeholders in the clinical trial data sharing landscape, the multiple contexts in which clinical trial sponsors share these data are relatively common and consistent across the industry, with shared data-sharing platforms and portals commonly used by many sponsors to enable researcher access to these data. These shared interests create opportunity for greater standardization. With the public benefits of making clinical trial data available for reuse,1-5 alongside inherent challenges with anonymizing these data types (for example, a typical study can have thousands of intricately correlated identifiable variables about each trial participant in the structured participant data), there is a practical need for a common framework or shared standard for sponsors to use in coordination with research platform hosts.4,31

As part of the anonymization process, one must consider the likelihood of an opportunity (including an attack) to identify data subjects.32 This involves assessing potential threats, or all the means reasonably likely to be used to identify data subjects, including deliberate re-identification attacks, inadvertent recognition of a close acquaintance by someone using the data, and data breaches (all referred to as “re-identification opportunities”).33 The re-identification opportunities over time should also be contemplated, with data retention, disclosure, and periodic assessments all being important considerations.

To support standardization, a process and framework for modelling data identifiability is needed to address a range of contextual re-identification opportunities. There are different ways in which identifiability can be modelled, and we opt to provide a conceptual representation of previously published and adopted statistical anonymization methodologies for measuring and managing re-identification risk. Industry consortia, such as PHUSE,34 TransCelerate35 and the Clinical Research Data Sharing Alliance (CRDSA),36 play a role in promoting standardization in the exchanges of clinical trial data and may advance this conceptual representation to help meet practical implementation needs.

The conceptual framework and basis for standardization we introduce for the SAFE Data Standard can be extended to other forms of data, such as the outputs from remote query systems or synthetic data, assuming that privacy metrics can be established and enforced under varying contexts. As an example, Stadler, Oprisanu and Troncoso recently evaluated the use of differential privacy and found that the principles for assessing synthetic data are similar to those followed when assessing transformation methods for anonymization. The authors thus demonstrate empirically that synthetic data does not provide a better trade-off between privacy and utility than transformation techniques to anonymize data.37 The practical goal in all cases is to identify the disclosure contexts shared frequently across clinical trial sponsors and align privacy metrics to the contextual risks associated with each, for consistency and greater standardization in how data are shared across these contexts.

The process we adopt for measuring and managing identifiability is described by the equation Data x Context = Identifiability, where Data is the probability given a re-identification opportunity, and Context is the probability of a re-identification opportunity. This conditional probability establishes the level of identifiability of a data set in a particular context. Moreover, we can define an inequality based on an identifiability threshold that a data set must not exceed to be deemed anonymized, using Data x Context ≤ Threshold.

As previously highlighted, this process we use to measure and manage identifiability as a basis for the SAFE Data Standard can be extended in practice to accommodate additional measures of privacy.Within this context, the tolerance and “threshold” concept can be interpreted more broadly as a similarity metric and further extended to other privacy metrics,38 such as those applicable to other techniques for disclosure control (eg, differential privacy for remote query systems or synthetic data).39 The process can also be augmented with inspiration from other industry implementations of data disclosure management,40,41 once there is an established baseline and framework within which standardization in the sharing and reuse of clinical trial data can be adopted globally. We prioritize pragmatism in the SAFE Data Standard given the inherently subjective nature of disclosure control and diminishing returns in practice from advanced usage of objective measures.42 In balancing pragmatism with the need for an objective basis for a rating system, we hope to further develop the proposed standard should it be effective in enabling data sharing consistency and efficiency.

The specific factors considered for the inequality on identifiability described above are illustrated in Figure 1.

Figure 1. Factors evaluated to anonymize data within the applicable and appropriate threshold, where certain factors (eg, motives and capacity) are influenced by contractual controls and training obligations.

Figure 1. Factors evaluated to anonymize data within the applicable and appropriate threshold, where certain factors (eg, motives and capacity) are influenced by contractual controls and training obligations.

Standardizing anonymization levels to make transparency more efficient and reliable

The need for a standard: consistency in the degree to which data are transformed

As trial data are increasingly shared and reused, a variety of data sharing platforms have been implemented to facilitate the process. Some were implemented to enable open access to trial documents, such as the EMA and Health Canada portals.11,17 Some were implemented to enable data sharing by sponsors with independent researchers, such as the Vivli platform,43 Yale’s Open Data Access (YODA) Project,44 and the Clinical Study Data Request consortium, (CSDR).45 Others were designed to foster collaboration among pharmaceutical sponsors, such as TransCelerate.

Because contextual factors—such as platform security and enforceable terms of use—influence the likelihood of re-identification, the degree to which data are transformed in the anonymization process should be commensurate with these controls. However, without a common standard to define the degree of transformation, sponsors and platforms may adopt inconsistent methods, potentially resulting in unnecessary erosion of data utility or weaker privacy protection than needed.

Inconsistencies in the availability of individual participant data for meta-analyses have been documented, with anonymization cited as one of the barriers.46,47 The SAFE Data Standard may ease the data sharing burden for sponsors and bring greater consistency to data shared with researchers for quality meta-analyses. It can also minimize erosion of data utility through the process of anonymization, an important concern for secondary research,48,49 ultimately promoting greater outcomes in the reuses of these data.

Proposed data transformation rating

To promote standardization and efficiency in the sharing of data, this paper proposes a SAFE Data Standard rating corresponding to a certain level of data transformation that can be used to quickly align stakeholders and effectively protect privacy. Because the design of a data-sharing portal (eg, security controls) and terms of use remain relatively constant over time for a single data platform, and certain characteristics of clinical trial data are constant, the level of data transformation needed to protect privacy can be standardized along a common scale from 0 to 5, where 0 is the raw trial data (often referred to as “coded” data due to clinical trials being blinded) and 5 is data transformed to the full extent required for access under an open data license or similar terms of use (eg, publication on EMA50 or Health Canada transparency portals).

This concept is illustrated in Figure 2, with each rating having a defined context and degree of data transformation described further in the following sections. While data utility remains higher with statistical anonymization than with traditional methods such as redaction, the relative decrease in data utility reflects the degree to which data are transformed to compensate for an absence of other mitigations (such as security controls).

Figure 2. Description of the SAFE Data Standard rating system, where data rated from 1 to 5 has been anonymized to reflect the context of data disclosure, with an increasing degree of data transformation and associated impact on data utility.

Figure 2. Description of the SAFE Data Standard rating system, where data rated from 1 to 5 has been anonymized to reflect the context of data disclosure, with an increasing degree of data transformation and associated impact on data utility.

To maintain an adequate level of privacy, each level of data transformation on the 5-point scale also requires appropriate data protection measures, such as security and privacy controls and user contracts. The less the data are transformed, the greater the protections they will require. For each level on the scale, the standard specifies the appropriate measures for protecting the privacy of participants in the data. (Each of these levels is further defined in Figure 6 in the final section.)

If the data were made public without terms of use (eg, posted publicly on Google with no published terms of use), the data would have even less contextual protection than what is specified by a level 5 rating. The complete public release scenario is not addressed by the SAFE Data Standard, though it may warrant transformations greater than those recommended for level 5. If those accessing data do not agree to any terms of use, the data become even more susceptible to demonstration attacks. Demonstration attacks are typically launched by the media or academics striving to prove that re-identification is possible.51,52 Given that an equivalent level of transparency can be attained through approaches adopted by the EMA and Health Canada, publishing clinical trial data with no terms of use is not typically required or recommended.

The 5-point rating is valuable because it communicates to all viewers not only the level of transformation to the data set itself, but also the protection measures that one would expect to find on a platform with a given rating. Moreover, the standard specifies the protection measures that platforms must implement to accommodate data at a particular utility level while maintaining adequate privacy. The result is a simplified concept of a numeric rating that can quickly be used to classify a data sharing platform.

Defining the appropriate level of data transformation

How data transformation is measured

Each rating level prescribes an appropriate level of data transformation along two dimensions. The first dimension is the strict data tolerance, or equivalent minimum cluster size of similarly looking individuals across all trial participants. This is also known as a group size, which is related to the concept of equivalence classes in k-anonymity while accommodating different implementations of the same concept in complex data types, such as longitudinal clinical data.33 The second dimension is the average data tolerance, or equivalent average cluster-size value, across all trial participants.

Cluster size is determined by the number of individuals who share the same indirectly identifying information. Figure 3 provides an illustrative example. In Figure 3, the highlighted data subjects form a cluster size of three since they all share identical values for the indirect identifiers of gender and year of birth.

Figure 3. Illustration of the cluster size concept, where the cluster size shown across two indirectly-identifying fields (gender and year of birth) is three due to three individual data subjects falling into this group.

Figure 3. Illustration of the cluster size concept, where the cluster size shown across two indirectly-identifying fields (gender and year of birth) is three due to three individual data subjects falling into this group.

If the minimum cluster-size value in this data set is two (strict tolerance level of 0.5), then every subject in the data set must have the same indirect identifier values as at least one other subject. In contrast, if the average cluster-size value is five (average tolerance level of 0.2), then the individuals in the data set must on average have the exact same indirect identifier values as four other subjects in the data. If a data set does not meet the desired minimum and average tolerance levels, then the indirect identifiers in the data set must be further transformed.

Average tolerances are relevant for private data-sharing releases in which the target of an adversary attempting re-identification could be any data subject (for example, an acquaintance such as an ex-spouse). The reason a strict condition is still applied to private releases is to ensure that no individual in the data is unique in the defined population.33 The strict condition helps prevent “singling out” and is applied in private releases to indirect identifiers that may be used to single individuals out (eg, demographics). Recital 26 of the GDPR explicitly mentions singling out as means of identification, and that “all the means reasonably likely to be used” need to be considered. Singling out would therefore be one such method that would always seem reasonably likely (if not a prerequisite) for the purposes of identification.53-55 Eliminating the ability to single out individuals is therefore a minimum condition, and, depending on the context and risks, average tolerance can then be evaluated for larger cluster sizes.

Tolerances also need to reflect real-world risks, which means evaluating cluster sizes for the population of individuals that gave rise to the data itself. The population we are concerned with is the one that contributes to the ability of someone to identify an individual in the shared or released data set. This may include the trial population, the population of similar trials, and the population in the same geographic area.9 Cluster sizes to determine identifiability are therefore evaluated using statistical estimators for the defined population.

When data are being made public, the minimum cluster size is more applicable in the statistical modeling because demonstration attacks are a risk. In a demonstration attack, an individual’s motive is to simply demonstrate that re-identification is possible, so the most identifiable record in the data is at greatest risk. Accordingly, for public releases, assume that an attack will occur and ensure a large minimum cluster size to protect against all types of attacks.

Basis for the data transformation rating levels

The degree to which data need to be transformed is determined by both the appropriate threshold and the context of the intended data disclosure (eg, security controls). This relationship is illustrated by the inequality on identifiability, shown in Figure 1, which we can rewrite as Data ≤ Threshold / Context. In other words, the probability given a re-identification opportunity is less than or equal to the threshold divided by the probability of a re-identification opportunity.

The data transformation rating can then use the relationship expressed in this inequality to prescribe the statistical point at which anonymization is achieved for an applicable context along a spectrum of identifiability. The less controlled the environment, the farther to the right the anonymization line moves and the more the data must be transformed. The more controlled the environment, the farther the anonymization line moves to the left and the more granularity and utility can be preserved in the data. Determining the optimal balance is key to ensuring that the most useful data are shared while sustaining consistent, proven privacy protection.

The minimum and average cluster-size equivalents provide the necessary guidance for transforming the data to the degree necessary for the context and applicable threshold.

Handling dates

Dates can be invaluable for secondary analysis, though also individually identifying. For secondary analysis and research, the sequencing and spacing in time between events for trial participants matters more than specific calendar dates. Accordingly, in conjunction with addressing the SAFE Data Standard tolerances, sponsors should zero the dates in a study through a proven method such as PHUSE date shifting.56 Offsetting dates reduces the level of date precision for participants while retaining utility. This approach can be part of a data transformation strategy to achieve the data tolerance and is also incorporated by default for all SAFE Data Standard levels.

Additional considerations that may be warranted

The data transformation rating defined by the cluster-size thresholds is designed to address the risks of a deliberate re-identification attempt and of a breach, both of which are mitigated by the controls in place and why the approach taken is proportionate to the level of control. However, in some cases, there is an additional risk to be managed: inadvertent or spontaneous recognition of an acquaintance in the data by the researcher or analyst working with the data.

At the level 5 rating for open data (under a license or terms of use), the transformations recommended would inherently address this risk. But for other data rating levels, additional protections may be warranted. If the data are going to be analysedby researchers who reside in the same region as a high concentration of trial participants, the sponsor may want to transform the data further to mitigate this risk. In most cases, due to the small populations for clinical trials and the infrequency of well-recognized celebrities enrolling in clinical trials, the likelihood of spontaneous recognition of an acquaintance is lower than the likelihood of a breach or deliberate attempt to re-identify. Accordingly, the recommended tolerances introduced will typically transform the data enough to protect against this threat.

Data utility

While privacy protection and relative utility are consistent at a given level in the SAFE Data Standard, the usefulness of the data for a given analytic objective can still vary. The data thresholds prescribed for a given SAFE Data Standard level can be achieved in multiple ways. Take, for example, a scenario in which the demographic indirect identifiers in a study have been transformed more than those associated with medical events, vital statistics, and substance use. That study could achieve the same rating level if the demographic identifiers were transformed less and the other identifiers were transformed more. In either scenario, the privacy level and the overall utility would be the same, but the usefulness of the data for a specific objective could vary significantly. For certain analytics objectives, the usefulness of the data resulting from one set of transformations could be much greater than from the other. For example, if exact age is critically important to retain for the desired analysis, more transformation to medical histories or substance use may be acceptable.

Measures of utility will depend on the intended use of the data. Because such uses are not always known upfront when data are shared, it is not always easy to optimize the distribution of transformations over the entirety of a given study. To the extent possible, stakeholders expecting to use the anonymized data (eg, researchers, internal data scientists) should provide input on what is most important to them upfront. For more open, generalized data disclosures serving many different audiences (eg, publication by Health Canada or EMA), guiding principles can promote sound decision making on utility trade-offs. For example,each disease area or indication can have certain prognosis factors that are prioritized for retention in the anonymization process.

While qualitative aspects of data utility are important, the degree to which utility metrics can be quantified can further augment and standardize the anonymization process. Global utility metrics can play a role when combined with other measures. Tailored utility metrics based on qualitative measures of importance (eg, weights assigned to prognosis factors by indication) can be even more effective in practice, particularly when defined in consultation with stakeholders who may use the data. As a practical example, data-sharing platforms or even industry consortia may identify priorities by indication that can be translated to retention metrics measured across clinical trial sponsors contributing anonymized data.

Open communication and a commitment to producing useful data that are nonidentifiable go a long way to meeting the needs of everyone involved. Ideally, framing that conversation around the inherent trade-offs can promote stakeholder alignment and support collaboration to make the most useful data available.

Proposed assessment framework

What the assessment framework provides

A simplified rating system can standardize data utility and privacy for more effective, efficient, and trustworthy transparency. As presented, the data transformation rating specifies how much the data at each rating level have been transformed and gives an indication of how much utility the data have retained.

To ensure that a data transformation rating is appropriate for a given situation, the proposed assessment framework not only provides a rating scale for data transformations but also prescribes appropriate uses and contextual measures to accompany each data transformation level.

Use, controls, and recipient trust: Variables that influence the data transformation rating

From our last inequality, Data ≤ Threshold / Context, one can see that the level of data transformation needed for a data set is proportionate to the desired threshold and its relation to the data release context, or Data ∝ Threshold / Context. We mean proportionate in the general sense of forming a relationship. Once the threshold is selected, the relationship between data and context could be linear, or the relationship could be monotonic.

While some trials may entail more sensitive personal information (eg, an HIV trial versus a rheumatoid arthritis trial), the potential privacy harm should be balanced with the potential non-disclosure harm (or looked at the other way, the harm should be weighed against the benefits). While the EMA and Health Canada consider that sponsors may adopt different thresholds with evidence-based rationale, the recommended 0.09 generally applies independently of trial sensitivity. Doing so prevents disclosure bias toward less sensitive diseases (ie, favoring more transparency and disclosure for research into some conditions over others) and ensures that privacy protection is the dominant influencer on the granularity of data shared (which will naturally result in less granular information shared on rare disease trials, as an example, but for reasons of privacy protection).Given this standard, several sponsors have adopted 0.09 as a guiding principle for their external disclosure practices. However, thresholds may be adjusted based on circumstances, such as those involving highly sensitive and stigmatizing data, while taking into account participant expectations.57

Given the established industry standard for external disclosures, we narrow our focus for determining the threshold based on the nature of data use, meaning benefits and reasonably expected individual approval related to such uses. In other words, for Data ∝ Threshold / Context, the level of data transformation needed is proportionate to the threshold which is informed by the nature of use, divided by the context which is influenced by controls and trust. Threshold and context are in turn dependent on the type of data use, the controls in place to prevent re-identification, and the level of trust in data recipients, or data transformation ∝ use/controls and trust.

Thus, there are three distinct variables in the assessment framework that must be considered to determine whether a data transformation is situationally appropriate.

  1. Use: The desired threshold is determined by the intended data use. Whether the data are being disclosed to external party(ies) or reused by the sponsor of the trial for new or secondary research can influence which threshold would be considered appropriate. There is established consensus on a threshold of 0.09 (cluster size of eleven) for external disclosures of trial data subject to basic terms of use. For internal reuses of data by a sponsor for further R&D beyond the original trial for which data was collected and consented to, a higher threshold can be argued for based on the extended participant and societal benefits of R&D and trial participants’ general expectations of how sponsors use collected data.1-3,7,8,58 Therefore, where the use (or reuse) of trial data by a sponsor is generally aligned to the purposes of collection from participants (eg, R&D), a higher threshold can preserve more utility. In all cases, appropriate justification should be documented by the sponsor along with the details of the anonymization strategy implemented.
  2. Controls: The data release context is defined in part by the extent of security and privacy controls in place as part of the data release. These controls prevent deliberate re-identification attempts and reduce the risk of a breach.
  3. Recipient Trust: An assessment of the data release context also requires that one consider the degree to which recipients of data are known, trusted, and subject to enforceable terms of use that deter or prevent actions that would increase the likelihood of re-identification. Enforceability refers to the sponsor’s ability to impose financial and/or reputational consequences, through legal action, loss of funding, or otherwise, for non-compliance (eg, through a legally enforceable data sharing contract).

To standardize and achieve a common, defensible rating, these contextual factors need to be evaluated consistently. The following sections provide an assessment framework for controls and recipient trust.

Evaluating use

As introduced earlier, sponsors should consider the ethics and bioethics of data sharing. The SAFE Data Standard pertains to data reuses and disclosures in which there are potential public or societal benefits, whether through new drug discovery or R&D, trial transparency, advancements in clinical research, or otherwise. Data reuses and disclosures that do not pertain to human health are excluded from the envisioned application of the SAFE Data Standard. For the purposes of the rating scale, two generalized types of trial data reuse can be distinguished as follows, both of which have an expectation of broader benefits and that may differ to the degree of anticipated approval by trial participants:

  1. Internal reuses of the data by the trial sponsor
  2. External reuses and third-party disclosures

Evaluating controls

If no controls are in place (for example, if data are being made available to the public for download), only the data need to be evaluated. However, for platforms or internal environments that do enforce privacy and data security, the following scale can be used to characterize the level of control. If the minimal level is not achieved (ie, if the basic controls are not in place), then a “zero control” context is assumed in determining the SAFE Data Standard rating.

Figure 4. Scale for evaluating levels of privacy and data security controls.

Figure 4. Scale for evaluating levels of privacy and data security controls.

If there are no enforceable terms of use established with data recipients, nothing needs assessing. However, if data access is restricted to known entities who agree to terms of use, the following scale can be used to characterize the level of recipient trust. If the enforceable criteria are not demonstrated, then no recipient trust is assumed in the SAFE Data Standard rating.

Figure 5. Scale for evaluating levels of recipient trust.

Figure 5. Scale for evaluating levels of recipient trust.

A complete framework for rating data and data-sharing platforms

In summary, the proposed data transformation rating is from 0 to 5, where 0 is the raw data and the scale from 1 to 5 reflects varying degrees of data transformation proportional to the type of use and contextual controls in place to protect data from re-identification opportunities. Consistently evaluating the uses of anonymized data and the anonymization context can speed and standardize anonymization processes applied across sponsors and enforce a common baseline for privacy while maximizing data utility and analytic benefits. Figure 6 summarizes the data transformation ratings from 0 to 5.

Figure 6. SAFE Data Standard scale reflecting the degree of data transformation to achieve privacy protection.

Figure 6. SAFE Data Standard scale reflecting the degree of data transformation to achieve privacy protection.

Once data are transformed as part of the anonymization process, sponsors should retain reports detailing the anonymization approach taken and associated justifications for auditability.

To illustrate the concept of the SAFE Data Standard, we applied the data tolerances from level 1 to 5 to simulate the transformation impacts on indirectly identifying data from a clinical study. (See Figure 7 below.) For clarity, directly identifying information or unique identifiers are masked or removed (for instance, such subject IDs are replaced with pseudonyms and site IDs are removed) and non-identifying information, such as a blood glucose reading (which can change frequently), is preserved during the anonymization process.

While the individual variable-level transformations will depend on the study characteristics in practice, including how distinguishable the participants are in the defined population and what preferences for data utility are incorporated (eg, country may be generalized to continent), the general trend of greater transformation and lower utility as you progress from level 1 to 5 is consistent. Figure 7 summarizes the results for the simulation performed, providing an example (not a ruleset) of how the SAFE Data Standard can be applied in practice across a range of disclosure contexts.

Figure 7. Example simulation to illustrate how the SAFE Data Standard can be applied in practice to clinical study data, recognizing that this is an example and not a rule (ie, actual transformations in practice will depend on study characteristics, such as how distinguishable the participants are statistically).

Figure 7. Example simulation to illustrate how the SAFE Data Standard can be applied in practice to clinical study data, recognizing that this is an example and not a rule (ie, actual transformations in practice will depend on study characteristics, such as how distinguishable the participants are statistically).

The simulation summarized in Figure 7 was based on a randomized, double-blind diabetes study sponsored by Janssen, and Janssen has since made the anonymized data available for secondary research through The YODA Project.59


Stephen Bamford, Sarah Lyons, Luk Arbuckle, and Pierre Chetelat conceived the article and were involved in writing and revising it. Stephen had the initial idea for the standard and the article, and he provided critical feedback throughout. Sarah developed the framework for the standard, produced a first draft and sought feedback on the draft from others in the field. Luk reviewed multiple iterations and added important intellectual content. Pierre reviewed the draft, enhancing content and simplifying criteria from prior standards. All authors were responsible for the content and for approving the final submitted manuscript. Stephen Korte and Ahmad Hachem (non-author contributors) ran statistical simulations of the SAFE Data Standard.

Stephen Bamford, Head of Clinical Data Standards & Transparency, The Janssen Pharmaceutical Companies of Johnson & Johnson, Sarah Lyons, General Manager, Privacy Analytics, Luk Arbuckle, Chief Methodologist, Privacy Analytics, and Pierre Chetelat, Research Associate, Privacy Analytics


  1. Institute of Medicine. Sharing clinical trial data: maximizing benefits, minimizing risk. Washington, D.C.; 2015. PMID:25590113
  2. Mello MM, Lieou V, Goodman SN. Clinical trial participants’ views of the risks and benefits of data sharing. N Engl J Med. 2018 Jun 7;378(23):2202–11. PMID:29874542
  3. Howe N, Giles EL, Newbury-Birch D, McColl E. Systematic review of participants’ attitudes towards data sharing: a thematic synthesis. J Health Serv Res Policy. 2018;23(2):123–33. PMID:29653503
  4. El Emam K, Abdallah K. De-identifying data in clinical trials. Applied Clinical Trials. 2015; 24(8):40–48. Available from: [accessed Mar 18, 2022]
  5. Meystre SM, Lovis C, Bürkle T, Tognola G, Budrionis A, Lehmann CU. Clinical data reuse or secondary use: current status and potential future progress. Yearb Med Inform. 2017 Aug;26(1):38–52. PMID:28480475
  6. Lyons S, Fagan V. Pharmaceutical clinical trials transparency and privacy. Medical Writing. 2020;29(4):52–7.
  7. The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. The Belmont report. Washington, D.C.: Department of Health, Education, and Welfare.; 1979 Apr. Available from: [accessed Nov 26, 2021].
  8. European Commission. Ethics and data protection. 2018. Available from: [accessed Nov 26, 2021].
  9. PHUSE Data Transparency Working Group. A global view of the clinical transparency landscape—best practices guide. PHUSE; 2020 p. 60. Report No.: WP-35. Available from: [accessed Nov 26, 2021].
  10. European Medicines Agency policy on publication of clinical data for medicinal products for human use. Available from: [accessed Nov 26, 2021].
  11. Health Canada. Public release of clinical information: guidance document. 2019. Available from: [accessed Nov 26, 2021].
  12. PHUSE Data Transparency Working Group. Data anonymisation and risk assessment automation. PHUSE; 2020 p. 28. Report No.: WP-045. Available from: [accessed Nov 26, 2021].
  13. Article 29 Data Protection Working Party. Opinion 05/2014 on anonymisation techniques. Brussels, Belgium; 2014 Apr. Report No.: WP216. Available from: [accessed November 26, 2021]
  14. AEPD, EDPS. 10 misunderstandings related to anonymisation. Agencia Española de Protección de Datos & European Data Protection Supervisor; 2021 p. 7 Available from: [accessed Nov 26, 2021].
  15. Polonetsky J, Tene O, Finch K. Shades of gray: seeing the full spectrum of practical data de-identification. Santa Clara Law Review. 2016;56:593–629.
  16. Hintze M, El Emam K. Comparing the benefits of pseudonymisation and anonymisation under the GDPR. Journal of Data Protection & Privacy. 2018 Dec 1;2(1):145–58.
  17. European Medicines Agency. External guidance on the implementation of the European Medicines Agency policy on the publication of clinical data for medicinal products for human use. 2018 Oct. Report No.: EMA/90915/2016 Version 1.4. Available from: [accessed Nov 26, 2021].
  18. Benschop T, Welch M. Statistical disclosure control for microdata: theory. SDC Theory Guide documentation. 2019; Available from: [accessed Nov 26, 2021].
  19. Barbaro M, Zeller Jr. T. A face is exposed for AOL searcher no. 4417749. New York Times. 2006; Available from: [accessed Nov 26, 2021].
  20. Narayanan A, Shmatikov V. Robust de-anonymization of large sparse datasets. In: Proceedings of the 2008 IEEE Symposium on Security and Privacy. 2008. p. 111–25. doi:10.1109/SP.2008.33
  21. Sweeney L. Only you, your doctor, and many others may know. Technology Science. 2015 Sep 29; Available from: [accessed Nov 26, 2021].
  22. Rubinstein I, Hartzog W. Anonymization and risk. Washington Law Review. 2016;91:703–60.
  23. Branson J, Good N, Chen J-W, Monge W, Probst C, El Emam K. Evaluating the re-identification risk of a clinical study report anonymized under EMA Policy 0070 and Health Canada regulations. Trials. 2020 Feb 18;21(1):200. PMID:32070405
  24. Gray SV, Hill E. The academic data librarian profession in Canada: history and future directions. Western Libraries Pubications Paper 49. 2016;15. Available from: [accessed Nov 26, 2021].
  25. Ritchie F. UK release practices for official microdata. Statistical Journal of the IAOS. 2009;26(3,4):103–11.
  26. The data spectrum – the ODI. Available from: [accessed Nov 26, 2021].
  27. Tam S-M, Farley-Larmour K, Gare M. Supporting research and protecting confidentiality. ABS microdata access: current strategies and future directions. Statistical Journal of the IAOS. 2009;26(3,4):65–74.
  28. Desai T, Ritchie F, Welpton R. Five safes: designing data access for research. Bristol, UK: University of the West of England; 2016 p. 27. (Economics Working Paper Series). Report No.: 1601. Available from: [accessed Nov 26, 2021].
  29. Arbuckle L, Ritchie F. The five safes of risk-based anonymization. IEEE Security & Privacy. 2019 Oct;17(5):84–9. doi:10.1109/MSEC.2019.2929282
  30. Arbuckle L, Muhammad Oneeb Rehman Mian. Engineering risk-based anonymisation solutions for complex data environments. Journal of Data Protection & Privacy. 2020;3(3):334–43.
  31. Rydzewska LHM, Stewart LA, Tierney, JF. Sharing individual participant data: through a systematic reviewer lens. Trials. 2022 23, 167.
  32. Marsh C, Skinner C, Arber S, Penhale B, Openshaw S, Hobcraft J, et al. The case for samples of anonymized records from the 1991 census. Journal of the Royal Statistical Society, Series A (Statistics in Society). 1991;154(2):305–40. doi:10.2307/2983043
  33. El Emam K, Arbuckle L. Anonymizing health data: case studies and methods to get you started. Sebastopol, CA: O’Reilly Media; 2013. ISBN:1449363075
  34. PHUSE— the global healthcare data science community. Available from: [accessed Nov 26, 2021].
  35. TransCelerate - pharmaceutical research and development. Available from: [accessed Nov 26, 2021]
  36. CRDSA - Clinical Research Data Sharing Alliance. Available from: [accessed Nov 26, 2021].
  37. Stadler T, Oprisanu B, Troncoso C. Synthetic data -- anonymisation groundhog day. arXiv:201107018 [cs]. 2021 Sep 22 [cited 2021 Oct 4]; Available from: [accessed Nov 26, 2021].
  38. Wagner I, Eckhoff D. Technical privacy metrics: a systematic survey. ACM Comput Surv. 2018 Jun 12;51(3):57:1-57:38. doi:10.1145/3168389
  39. Biswal D, Arbuckle L, Kulik R. Disclosure metrics born from statistical evaluations of data utility. In: Proceedings of UNECE Work Session on Statistical Confidentiality. Poznań, Poland; 2021. Available from: [accessed Nov 26, 2021].
  40. Ian Oppermann. Privacy in data sharing: a guide for business and government. Australian Computer Society; 2018 p. 100. Available from: [accessed Nov 26, 2021].
  41. Ian Oppermann. Privacy preserving data sharing frameworks: report on July 2019 directed ideation #2 series. Australia: Australian Computer Society; 2019 p. 43. Available from: [accessed Nov 26, 2021].
  42. Ritchie F. Microdata access and privacy: what have we learned over twenty years? Journal of Privacy and Confidentiality [Internet]. 2021;11(1):8. doi:10.29012/jpc.766
  43. Vivli - Center for Global Clinical Research Data. Available from: [accessed Nov 26, 2021].
  44. Center for outcomes research & evaluation, Yale school of medicine ‘YODA project’. Available from: [accessed Nov 26, 2021].
  45. Project Data Sphere. Available from: [accessed Nov 26, 2021].
  46. Nevitt SJ, Marson AG, Davie B, Reynolds S, Williams L, Smith CT. Exploring changes over time and characteristics associated with data retrieval across individual participant data meta-analyses: systematic review. BMJ. 2017;357. PMID:28381561
  47. Nevitt S. Data sharing and transparency: the impact on evidence synthesis. Dissertation. The University of Liverpool (United Kingdom); 2017. Available from: [accessed Nov 26, 2021].
  48. Ferran J-M, Nevitt SJ. European medicines Agency policy 0070: an exploratory review of data utility in clinical study reports for academic research. BMC medical research methodology. 2019;19(1):1–10. PMID:31690260
  49. Nevitt SJ, Ferran J-M. Data utility in anonymised clinical study reports (CSRs). PhUSE 2017. Available from: [accessed Nov 26, 2021].
  50. El Emam K, Boardman R. Are clinical trial data shared by the EMA under policy 0070 really public data? Applied Clinical Trials. 2018. Available from: [accessed Nov 26, 2021].
  51. Arbuckle L, El Emam K. Building an anonymization pipeline: creating safe data. Sebastopol, CA: O’Reilly Media; 2020. ISBN:1492053430
  52. El Emam K, Jonker E, Arbuckle L, Malin B. A systematic review of re-identification attacks on health data. PLoS ONE. 2011;6(12). PMID:22164229
  53. Jay R, Malcolm W, Parry E, Townsend L, Bapat A. Guide to the General Data Protection Regulation. Sweet & Maxwell; 2017. ISBN:0414061012
  54. What is personal data? ICO; 2021. Available from: [accessed Nov 26, 2021].
  55. Borgesius FJZ. Singling out people without knowing their names–behavioural targeting, pseudonymous data, and the new Data Protection Regulation. Computer Law & Security Review. 2016;32(2):256–71.
  56. PhUSE De-Identification Working Group. De-identification standards for CDISC SDTM 3.2. 2015.
  57. Frederic Gerdon, Helen Nissenbaum, Ruben L. Bach, Frauke Kreuter, Stefan Zins. Individual acceptance of using health data for private and public benefit: changes during the COVID-19 pandemic. Harvard Data Science Review. 2021;28. Available from: [accessed Nov 26, 2021].
  58. Kass NE, Natowicz MR, Hull SC, Faden RR, Plantinga L, Gostin LO, et al. The use of medical records in research: what do patients want? J Law Med Ethics. 2003;31(3):429–33. PMID:14626550
  59. The YODA Project (NCT01809327). A randomized, double-blind, 5-arm, parallel-group, 26-week, multicenter study to evaluate the efficacy, safety, tolerability of canagliflozin in combination with metformin as initial combination therapy in the treatment of subjects with type 2 diabetes mellitus with inadequate glycemic control with diet and exercise. Available from: [accessed Nov 26, 2021].
Related Content
© 2024 MJH Life Sciences

All rights reserved.