De-identifying Clinical Trials Data


Applied Clinical Trials

Whether we are referring to CSRs or to IPD, the personal information of trial participants needs to be de-identified prior to release. This article will describe the available methods for de-identifying clinical trial data, and the relative strengths and weaknesses of each.


There is a recognition within the research community that the re-analysis of clinical trials data can provide new insights compared to the original publications.1 In addition, the increasing complexity of science requires multiple teams to collaborate and analyze the same data set from different perspectives, pool data sets, and link multiple data sets.

Regulators at the European Medicines Agency (EMA) have also issued a policy on the release of clinical study reports (CSRs) for drugs receiving marketing authorization.2 A manufacturer is expected to provide the agency with the full CSR as normal, and a de-identified copy, which the agency will then release. There is an expectation that in the coming years another policy from the EMA on the release of individual participant data (IPD) from these clinical trials will be issued as well. To achieve the benefits of data sharing, and in anticipation of a policy on IPD from the EMA, manufacturers have already started implementing the infrastructure and mechanisms to release IPD.

Whether we are referring to CSRs or to IPD, the personal information of the trial participants needs to be de-identified. De-identification means that the probability of determining the identity of an individual in the data set is very small. In this article, we will describe the methods that can be used to de-identify clinical trial data, and their relative strengths and weaknesses.

Data sharing models

Manufacturers may select trials that they wish to share data for, and then de-identify the data sets in advance of any data request for all of these selected trials. When a request for a data set is received, the prepared data set is then provided. Another approach is to only de-identify a data set after a request has been made and approved.

The former approach ensures an expeditious response to data requests, but has a number of disadvantages. The first is that it assumes that all selected trials data sets will at some point be requested. This is not necessarily going to be the case. Therefore, data sets may be de-identified that are never shared. The second disadvantage is that the de-identification is not calibrated to the needs of the data requestor. For example, a de-identification exercise may generalize the geography from country to continent, but the data analyst may wish to do country-by-country comparisons or include country as a covariate in their analysis. This would not be possible if the data had already been de-identified in advance without geography at the country level.

Therefore, the choice of approach is a trade-off between responsiveness, efficiency, and data quality. There are examples of each approach being employed in practice. This trade-off applies to both CSR and IPD de-identification.

For sharing IPD, there are two commonly used mechanisms: (a) microdata, and (b) a portal. Under the former mechanism, individual participant records are released (microdata) to the data recipient, such as the Project Data Sphere. This could be in the format of SAS files, a set of simple comma separated variable files, or some other format. Alternatively, the manufacturer may provide access to the data through a portal. This second option does not allow the data recipient to download the raw data and all analysis is performed through the portal. The portal would have common statistical and graphing software pre-installed allowing the data recipient to perform the necessary analysis wholly within the portal environment.


Re-identification risks

It is important to keep in mind that de-identification is a legal requirement in many jurisdictions. That means that it is not permissible to share personal information without consent where the probability of identifying an individual is high. If a manufacturer does so then there may be negative legal, financial, and reputational consequences. These negative consequences may accrue directly from regulators, litigation, and bad media coverage.



There are a number of plausible risks and attack vectors on a clinical trial data set. One risk to consider is that of a study participant filing a complaint with their regulator. This may be a data protection authority in the European Union (EU), or a privacy commissioner in Canada, for example. In such a case, the regulator must make a determination of whether the data set was truly de-identified. If the determination was that the data was not properly de-identified, then the data release would be considered a data breach. Some manufacturers can attempt to pre-empt such a risk by obtaining support from their regulator in advance of data sharing. However, such support would be needed from every jurisdiction from where data is being collected.

If the mechanism of data release is a microdata set, then there are three plausible re-identification attacks on the data by an adversary that need to be protected against, as summarized in Table 1.3,4,5 To mitigate against these attacks, it is necessary to: (a) have the data recipients sign a contract or agree to terms-of-use prohibiting re-identification, (b) the contract needs to impose strong security and privacy practices on the data recipient, and (c) the data itself needs to be modified to reduce the risk of re-identification.

In the case where the data release mechanism is a portal, then the portal operator can implement many of the security controls, but the data recipient still needs to implement some practices at their end as well (e.g., from basic things like not share their passwords for the portal, to security measures on the machines they use to access the portal). In addition, portal operators almost always require some kind of contract or terms-of-use. Therefore, the main risk for when using a portal to share data is from an inadvertent re-identification by a data user, which still requires the data to be modified.

Therefore, defensible de-identification needs to be deployed in a manner that will satisfy multiple regulators, and irrespective of whether the data is being shared as microdata files or through a portal.


De-identification principles

In general, no consent for data sharing was sought from historical trials, and, therefore, de-identification would be a legal requirement for sharing those data. Prospectively, it would be possible to ask new participants for consent. However, consent to share is typically accompanied by assurances that the data sharing will be of de-identified data and that responsible practices will be employed.

There are two general approaches to de-identification exemplified by the two methods stipulated in the Privacy Rule of the US Health Insurance Portability and Accountability Act (HIPAA) of 1996:6 (a) Safe Harbor, and (b) Expert Determination.

A Safe Harbor approach requires that a specific pre-defined and fixed set of variables or information types are removed or generalized in all clinical trial data sets. It is a simpler approach, and has been adopted by some manufacturers. This approach has a number of disadvantages:7

  • The Safe Harbor method is only legally accepted in the US and not in any other jurisdiction. Furthermore, its empirical foundation was based on an analysis of US census data. This method has no empirical basis in other jurisdictions.

  • Assuming a fixed set of variables upfront will not account for all cases where there may be more variables that need to be de-identified nor does it account for changes in technology where additional variables become risky over time from a re-identification perspective.

  • It is known that data sets, which meet the Safe Harbor standard, can still have a high risk of re-identification.

In order to have defensible de-identification practices, they need to meet some basic principles (which are derived from the Expert Determination method under the HIPAA Privacy Rule):

  • Use generally accepted statistical and computational practices in the disclosure control community to de-identify data.

  • Ensure that the risk of re-identification is very small.

  • The de-identification should be performed by an analyst with appropriate knowledge of these de-identification practices.

  • Document the methods used and their results.




De-identification results in the loss of information in the data, or in a reduction in data utility. This means that the data is perturbed in some manner (various techniques will be described ahead). The amount of perturbation should not be severe, otherwise the conclusions drawn from the analysis of the de-identified data may be different from the conclusions drawn from the analysis of the original data. It would not be acceptable, and indeed could be quite problematic, if the conclusions on de-identified data from the same analysis are materially different from those in published studies. Although, it has been argued that a re-analysis that does nothing but confirm the original conclusions probably has little scientific value and would not be publishable anyway.8

On the other hand, when preparing a de-identified data set, the amount of perturbation cannot be so minimal that the probability of identifying an individual participant remains high. Therefore, a balance must be reached between the two objectives: high utility data and an acceptably low probability of re-identification.

In practice, simply stating that the probability of re-identification is small will not be sufficient. It is necessary to have objective evidence demonstrating that the probability of re-identification is small. The primary reasons are that a simple statement may not be convincing to a regulator investigating a complaint from a trial participant, a regulator investigating a breach, or in litigation about whether personal information was disclosed.

Let us explore a de-identification example in which the date of birth of participants in a clinical trial data set is changed to year of birth, and no other variables in the dataset are altered. The analyst performing the de-identification states and believes that the risk of re-identification is acceptably small. Imagine then there is a data breach, a complaint to a privacy regulator about data sharing by one of the trial participants, or litigation arguing that the shared data is not de-identified properly. The only evidence that the manufacturer has to support its claim that the data is indeed de-identified is the analyst’s statements. These may be countered by another analyst’s claims on the plaintiff’s side. Without objective data demonstrating that the risk is small, the manufacturer’s claims can be easily discounted or cast into doubt. This is not an issue to be taken lightly by manufacturers or those developing de-identification standards.

Consequently, it is important to consider methods to measure the probability of re-identification and to demonstrate that the risk of re-identification is indeed acceptably small.

This basic principle is applicable to the de-identification of CSRs and IPD. Since both of these types of information pertain to the same study, it would not be appropriate to have more personal information included in a shared CSR but less in an IPD, or vice versa. Doing so could leak personal information unintentionally on a systematic basis. The level of de-identification needs to be consistent for both.

Personal information in a clinical trial data set

A data set has two types of variables that can leak personally identifying information: direct identifiers and quasi-identifiers. Direct identifiers are features like names, telephone numbers, full addresses, social security numbers, clinical trial participant numbers, and medical device serial numbers. That kind of information is considered to easily identify trial participants and is either removed from a de-identified file that is shared, or in the case of unique identifiers, pseudonymized. A classification of variables into direct and quasi-identifiers for clinical trials has been completed by a PhUSE (Pharmaceutical Users Software Exchange) working group.9



Sometimes it is necessary to disclose unique identifiers, such as a clinical trials participant number or subject ID. Such unique identifiers are important to allow the linking of records that belong to the same individual. While some may argue that it is hard to re-identify individuals using this type of number, it is explicitly mentioned by regulators as a direct identifier, for example, in the HIPAA de-identification guidelines.10 These types of unique identifiers would be converted into pseudonyms to allow multiple records that belong to the same participant to be linked together. There are established standards for such pseudonymization.11

The second type of variable, the quasi-identifiers, consists mostly of dates, location information, demographics, and socio-economic information. Rare diagnoses, concomitant illnesses and medications, and serious adverse events such as death, hospitalization, and birth defects would also be considered quasi-identifiers.

This type of information can identify individuals in a data set: All known re-identification attacks (on non-genetic data) have been performed using only these quasi-identifiers.12 Therefore, it is important to deal with quasi-identifiers. However, we cannot just remove them as these variables are very useful for the analysis. More sophisticated techniques need to be applied to retain the value of these variables but also reduce the probability that these variables can re-identify participants.5

Modifying direct identifiers

Manipulations of direct identifiers effectively eliminate the risk of re-identification from that information. Specific issues do arise, however.

Direct identifiers in text

Most direct identifiers, such as participant names, are rarely transmitted to the sponsor. However, such information may be included in free-form text fields.

Many manufacturers simply remove free-form text fields. This may result in some key information being removed (such as a narrative about why a participant withdrew from a study). There is a general belief that text fields cannot be easily de-identified. That is not necessarily the case. There are good automated techniques that will detect direct and quasi-identifiers in free form text.5,13 To the extent that text information is useful for the analysis of trial data, this data can be de-identified as well.

One challenge of free-form text information is when it is inserted in the incorrect fields. For example, a field that may expect dosage information may have participant identifying information in it. With modern electronic data entry systems, validation is typically performed at the point of data entry for many fields, but some are still left open. Furthermore, such validation may not have been in place for older trials with less sophisticated data entry systems. This would not be anticipated a priori by the de-identification team and may result in identifying information being exposed.

It is, therefore, necessary to run a PHI discovery tool on the data set to identify the fields and records that may potentially contain PHI. A discovery tool processes all of the text and flags records or documents that may contain PHI by using information extraction techniques.14 Once the discovery phase is completed, then these rogue text comments can be redacted.

Unique participant identifiers

All unique participant identifiers, such as participant IDs, need to be pseudonymized.11 We will explain the reasoning behind this below.



Unique identifiers sometimes are made up of a site number, patient initials, and patient date of birth. All of these elements can increase the probability of re-identification. Even if that is not the case, at least in the US, there is a regulatory issue to consider as well which clarifies the status of a record identifier and the ability to assign an identity to it by a covered entity (e.g., the hospital providing the data). Under the HIPAA Privacy Rule (45 CFR 164.514 (c)) it states:

A covered entity may assign a code or other means of record identification to allow information deidentified under this section to be reidentified by the covered entity, provided that:

(1) Derivation. The code or other means of record identification is not derived from or related to information about the individual and is not otherwise capable of being translated so as to identify the individual; and

(2) Security. The covered entity does not use or disclose the code or other means of record identification for any other purpose, and does not disclose the mechanism for re-identification.

A subject ID would not meet this requirement because it is clearly a code or other means of record identification that would allow re-identification by a covered entity that is “derived,” and disclosure of a subject ID by itself reveals the method of re-identification (i.e., match against the medical records system).

If that code was encrypted, then it would still be considered derived. However, the HHS guidance10 clarifies item (c)(1) by stating that a code can be a cryptographic pseudonym provided that the keys associated with such functions are not disclosed, including to the recipients of the de-identified information. But unless a subject ID is pseudonymized using a cryptographic function then this clarification would not be applicable.

Furthermore, a subject ID is used quite extensively in the context of site operations. The subject ID will show up on case report forms (CRFs) and all other trial documentation, including adverse event reports and billing records. The covered entity will clearly use and disclose the subject ID for other purposes quite extensively.

Therefore, it should always be the case that a subject ID is treated as a direct identifier and pseudonymized.

Another type of unique identifier would be an ID automatically generated by a database that is used only to track records within the database. While this is still a unique identifier it is not used outside the database. The fact that this ID is only used within the database means that the only way to re-identify individuals is if an adversary already has access to the database. Therefore, unlike subject IDs, these types of machine IDs would not necessarily be considered direct identifiers since their access and use are quite different from subject IDs in a trial.

Unique identifiers that are pseudonymized should have a uniform distribution (e.g., one value per participant). Otherwise they are susceptible to a frequency analysis. This is elaborated upon further ahead.

Where the same patient is included in a sub-study, then the same pseudonym can be used in the sub-study as well.

Provider details

Is it necessary to remove all of the personal information (the direct identifiers) about the clinicians and site staff who took part in the trial? This information will typically be the name and contact information of these clinicians and trial staff. That information may be in the structured data or in the CSRs. The answer will depend on the jurisdiction.

In the U.S. under the HIPAA Privacy Rule, information about participants or of relatives, employers, or household members of the participants needs to be considered when de-identifying a data set. In this case, it is not necessary to remove the provider information. However, HIPAA also has the minimum necessary requirement, which states that protected health information should not be used or disclosed when it is not necessary to satisfy a particular purpose or carry out a function. If the provider information is not going to serve any purpose when disclosed, then it should not be disclosed.



One should also be cognizant that, even if not required by regulation, if identifiable data about a provider is shared, there is the risk that the provider may initiate litigation on an invasion-of-privacy claim. Consequently, it is advisable not to share provider information, especially if it does not serve a purpose.

In Canada, the issue of disclosing personal information about providers has come up in the context of the disclosure of prescription data. It has been determined that in this case the disclosure of prescriber identity is not considered a disclosure of personal information and is permitted.15

Unlike in Canada, in the EU prescriber information is considered personal information about the provider, even if the patient is de-identified.16 The Article 29 Working Party states that “providing information about prescriptions written by identified or identifiable doctors […] constitutes a communication of personal data to third-party recipients in the meaning of the Directive.”16 Arguably then, the same logic applies to providers and site staff in clinical trials.

Therefore, given that clinical trials data will be coming from multiple jurisdictions, it will be more prudent to actively remove provider and site staff personal information from the data.

Removing provider information has another advantage: Provider information can reveal where a participant lives. Since it is easy to determine the practice location for providers if their identity is known, this can lead to predicting the residence location of patients.5

Modifying quasi-identifiers

One way to manipulate the probability of re-identification is to change the granularity of the quasi-identifiers. For example, a date of birth can be changed to a year of birth. Or an exact income, for example, may be changed to an income range. This is a form of generalization.

It may also be necessary to suppress specific values in the data. For example, there may be something about a trial participant’s demographics or medical history that makes them an outlier. That particular piece of information may be suppressed in the shared data set so that they are no longer such an outlier. Alternatively, whole patients maybe removed. Although because the number of participants in many trials is not large, the removal of participants would have a nontrivial impact on data utility. Suppression may also be vertical where a whole field is suppressed and not shared at all.

Other disclosure control techniques, such as sub-sampling, would generally not be applicable for clinical trials data because the data sets are already small. A sub-sample would affect statistical power and possibly affect the conclusions drawn from the study.

The amount of modification that needs to be applied to the quasi-identifiers depends on the context of the data release (see Figure 1). This context-based approach is consistent with guidelines from regulators.10,17,18 Context consists of characterizing:

  • The data recipient (the security and privacy controls that they have in place, as well as their motives and financial resources to re-identify the data).

  • The data (the sensitivity of the data, the potential harm to the participants if there is a re-identification, and the extent to which the participants consented or have been informed about the data sharing when the data was collected or since then).

  • The contract (the existence and strength of the contractual clauses limiting re-identification attempts).

Figure 1: Elements of the context of a data release.

Once the amount of de-identification that needs to be applied has been determined, specific generalization and suppression techniques can be applied on the data. Beyond that, there are some factors that are somewhat unique to clinical trials data sets that add complexity but also introduce opportunities to create high utility data sets.




One of the pieces of information that increases the probability of re-identification significantly are dates. In many instances the exact dates are not critical for analyses, but rather the time since enrolment or randomization in a study is sufficient. For example, it is important to know how many days after visit 2 a participant experienced an adverse reaction. This can be achieved by using the enrollment date as an anchor and computing all dates as intervals since then. This is illustrated in Table 2.

This approach works well if the recruitment period is spread out over a long interval. If recruitment is happening over a short interval, then it is easy to determine the anchor date for many participants. For example, this is typically the case for vaccine trials.19 Also, if the recruitment period for a trial site or a country can be determined from public sources (e.g., trial registration websites), then it would also be possible to determine the anchor date or approximate it. If an adversary can determine the anchor date, then the anchoring scheme is not going to protect privacy. For example, if it is known that all recruitment occurred in the second half of January then the first visit date can only be between Feb. 5 and Feb. 21. Whether this two-week range for the first visit date is acceptable or not needs to be assessed more thoroughly. But the point is that trying to protect the first visit date is constrained by the known recruitment interval.

Another point to note with dates is that the visits tend to follow a well-defined pattern. This means that for the majority of participants, their interval between visits tends to be the same (or varies a little around the value in the protocol) and predictable from the protocol. Therefore, manipulating or perturbing inter-visit intervals is futile because they can be easily reverse-engineered.

If it is desirable to have dates instead of intervals, then an actual anchor date can be generated randomly during the recruitment period. This is illustrated in Table 3for the second half of January recruitment period. The first visit is then 21 days from the randomized anchor date, and so on, ensuring that intervals are still accurate.

Another option is to have an anchor that is set as the start of the recruitment period for the whole trial. All participants get exactly the same anchor (instead of a random anchor within the recruitment period). This is illustrated in Table 4.



One of the decisions that needs to be made is the level of granularity of the geographic information that should be released. Geography can be at the site level, or generalized to city, country, or region/continent. The appropriate level of geography should be reduced if the risk of re-identification of participants is too high.

An important consideration of geography is to ensure that an adversary cannot infer more detail than intended from other information. Ahead we discuss a number of inference scenarios using role information and site frequency information.

Correlations & inferences

It should be noted that fields in a clinical trials data set and their values may be correlated. For example, consider a situation in which it was decided to remove the country of the sites from the data set because it was considered that that information would increase the probability of re-identification. Some countries do not allow the collection of race information, however. Therefore, by examining the pattern of missingness of the race data it may be possible to determine the country and reverse the information that was to be hidden. Other examples of correlations are between drugs and diagnoses in concomitant medications and illnesses-if one of these elements is suppressed, it may be easy to recover the suppressed value.

Some studies have information about the role of a provider or site staffer. For example, a role may be a specialty or someone who can operate a specific machine or device. If that role is unique or concentrated in a specific site within the geography, then it may be possible to infer the exact location of the facility. This may leak more information about geography than as intended.

A careful analysis of the data to understand possible correlation and inference channels needs to be performed and the correlations need to be addressed. In the race and country example, one solution would be to remove race from all participants.



Trial site information

One of the questions that needs to be answered is whether to include trial site names or not. And if not, whether to replace these with pseudonyms for each trial site or just indicate the region of the site. In the latter case, multiple trial sites may be grouped into the same region.

Trial registration websites, trial recruitment websites, and published articles reveal a considerable amount of detail about a clinical trial, often including the clinical sites that are recruiting in each country. Also, the highest recruiting sites and the lowest recruiting sites will generally be known, or can be discovered with basic effort. Therefore, if an attempt is made to hide the site information through pseudonymization, a frequency analysis of the number of participants at each site could reveal the identity of the most and least recruiting sites: they will be the pseudonyms with the highest and lowest participant counts, respectively. To the extent that knowing the actual site can increase the probability of re-identification of individual participants through geoproxy attacks on a data set5, then an elevated re-identification risk from a frequency analysis needs to be accounted for.

Therefore, in general, if any site pseudonym will be used, this should be accompanied by a frequency analysis. If the distribution is uniform or close to being so, then there should be no problem in releasing a site pseudonym. However, if the distribution is heavily skewed, then site information needs to be considered more carefully in the re-identification risk assessment. In practice, these distributions are likely to be heavily skewed.

A similar situation arises with country and number of trial sites. If a country name is replaced by a country indicator or pseudonym, a frequency analysis of the number of sites per country could reveal the country with the most sites.

Adverse events

Serious adverse events would be considered quasi-identifiers and need to be addressed, for example, death, hospitalization, stroke, congenital anomalies, and disability. Such events are knowable by an adversary and, therefore, would normally be treated as a quasi-identifier.

Other types of non-serious adverse events would be ignored from a de-identification perspective. For example, an adversary would not know if a participant had nausea or a headache during a trial. But a serious event such as a death would be knowable. Other notable events such as pregnancy or a birth would also be considered as potentially identifying.

Clinician adversaries

One of the threats that would be considered when de-identifying a data set is whether a data analyst could inadvertently recognize someone they know. This type of “spontaneous” re-identification may reveal additional clinical information about the participant that the analyst did not know. This risk is particularly high if the data analyst is a clinician and the disease is a rare one. As noted earlier. This type of risk exists for both mechanisms of data sharing (microdata and portal).

In such a case, there is a higher likelihood that the analyst (clinician) would recognize one of the trial participants. The participant may have been one of the clinician’s patients or a case that the clinician had been consulted about, or even a known case within the community of practice.

This type of risk can be directly and quantitatively modeled.4 If it is high, then additional precautions may need to be put in place, ranging from data perturbation to stronger contractual controls and training of data recipients.

It can be argued that clinicians are obliged to keep patient information confidential. Therefore, if they see identifiable participant information, then they are still bound by the same confidentiality requirements as for their regular patients. However, the analyst may be someone on the clinician’s research staff and not the actual clinician, and the participant may not want the clinician or their analysts to know all of the information that was collected in the trial, or that they even participated in a trial. For example, it is well known that patients hide sensitive information from their physicians at somewhat high rates.20



Rare diseases

There is a concern that data sets from clinical trials on rare diseases would be impossible or difficult to de-identify. This is not necessarily the case. There are a number of factors that need to be considered. First, the risk can be managed by imposing additional controls on the data release as noted earlier and these reduce the amount of perturbation that is needed to the data. If the trial is performed on a sample of patients with that disease, then the sampling itself reduces the risk because it introduces uncertainty about who is in the data set. Finally, if the rare disease is not visible and would not generally be knowable by an adversary, then that can reduce the probability of re-identification considerably as well (because an adversary would not likely have background information that persons they know have that disease).

Therefore, it should not be taken at face value that because a trial is on a rare disease that a defensible de-identification cannot be performed. The facts related to a specific data set will matter.

Deceased participants

In some clinical trials participants are not likely to survive for many years after the end of the study. For example, this would be the case for some oncology trials. Does that mean it is possible to share their personal health information after they are deceased?

This will depend on the jurisdiction. For example, in the U.S., information on the deceased is still considered personal health information for 50 years after death under the HIPAA Privacy Rule, and in general clinical sites will be covered by HIPAA.21

In the EU, The Data Protection Directive 95/46/EC applies to “natural persons,” and data on the deceased is not about natural persons any more. However, the Article 29 Working Party has noted that data on the deceased may still be personal information because it may indicate familial diseases relevant to living children or siblings, or may have other country-specific restrictions on it.16

The manufacturer needs to be cognizant that for global data from multiple jurisdictions that may fall under different regulations, the lowest common denominator would probably be the safest assumption to make (i.e., the most restrictive). Therefore, it would be prudent to treat all participants the same, whether they are deceased or not.

Sensitive data

Sensitive data would be information such as indications that patients have HIV or have substance abuse problems. Sensitivity of data is different from de-identification, and sometimes they are confused.

If a data set is properly de-identified, then participants with sensitive data should have a low probability of re-identification. In deciding how much to de-identify a data set, more de-identification is typically applied when there is sensitive information.4

Sensitive information may be removed completely if it is orthogonal to the main analysis of the trial data. Sometimes sponsors will also remove all sensitive information out of an abundance of caution.

The American Health Lawyers Association has summarized state-specific health laws.22 A union of the types of data covered by these laws is an empirical indication of which types of data are considered sufficiently sensitive that state specific laws were needed. That can serve as a starting point for defining sensitive information.

Risk measurement

To decide how much generalization to apply to the quasi-identifiers, it is necessary to measure the risk of re-identification. Specific metrics for clinical trials data have been described elsewhere.3 Within the scope of this article it is not possible to provide details about metrics, but it is important to note that risk measurement is a well-developed field. Furthermore, thresholds for what would be considered acceptable risk have been documented based on precedents.3

It should be noted that risk is measured on the population rather than the specific clinical trial. For example, if a trial has 500 diabetic patients and two of them are 50-year-old females, the risk is based on the number of 50-year-old female diabetics in the population and not on the those two individuals. The population will depend on the geography that is released (for example, country versus continent).



Moving forward

Now that specific considerations for dealing with direct and quasi-identifier have been covered, in this section we will address more general and implementation issues with de-identification.

Temporal validity of de-identification

There is often a concern about sharing data that may become more easily identifiable in the future. This is a future with more powerful computers and more data available publicly that can be used to more easily attack databases. Notwithstanding that this future has been predicted for many years now and has not quite arrived yet, it would be prudent to limit the validity of a de-identification to, say, 18 to 24 months. After 18-24 months, a new risk assessment is performed to ensure that the original de-identification assumptions have not changed. If they have changed, then a new data set would be provided to the investigators who received the original data.

Scaling de-identification

Scaling de-identification will be necessary as demand for clinical trials data increases. Using manual de-identification schemes that are customized for each clinical trials data set will be challenging in the long run. This means it is important for scalability to develop de-identification standards and automation. However, the automation needs to support the risk analysis that must be performed upfront to understand the data set and determine the appropriate risk thresholds and how to perturb the values.

A recent economic analysis concluded that if a manufacturer will be releasing four or more data sets a year, then it is more economical to automate the de-identification process.23 Automation would minimize the steps that are needed to prepare and de-identify each data set that will be shared.

As of the time of writing, the best data we have suggest that there are at least 1,200 clinical studies available on one large data sharing site, and there have been 58 requests for clinical trials data sets from some of the largest manufacturers.24 The expectation is that this number will grow over time, and the number of requests per manufacturer will also increase.

It is also going to be the case that demand for data will rise as data availability rises, and as the process for getting access to that data is further simplified. It is still in the early days, but it should be expected that manufacturer investments in sharing data and scaling that capability will be moderated by the demand for that data from the academic and other analyst communities. We need the same energy and advocacy resources to be redirected towards mobilizing data use and data users now, otherwise the gains in clinical trials data sharing may get diluted over time.

De-identification standards for clinical trials data

There are no generally accepted de-identification standards for clinical trials data yet. Although the recent publication of the Institute of Medicine report on trial data sharing establishes a strong reference point.3 The manufacturers who have been leaders in clinical trials transparency have developed their own methods which they have shared openly.25 Some of these early methods, however, are not based on risk measurement and do not explicitly account for the data release context. They can best be characterized as a set of heuristics. Such heuristics will not necessarily be defensible if they are challenged, and may be perturbing the data more than is necessary. More sophisticated measurement-based methods have been applied for specific clinical trial data sets26, and these need to be expanded into general processes that can be applied to any clinical trial.

This creates a need for more sophisticated measurement-based standards that are also more defensible. At this point it is not likely that regulators will develop specific standards themselves. Regulators may promote basic principles of de-identification, and point to existing standards. The standards themselves will have to be developed by academics and industry.



In addition to promoting best practices for de-identifying data, facilitating the development of a community of practice around de-identification, and helping scale de-identification through automation, standards will allow the pooling of de-identified data. If each manufacturer develops their own standard for de-identification, it will be challenging to pool data from multiple trials in the same therapeutic area. In fact, pooling of differentially de-identified data may bias analysis results.

The impact of standards, even de facto ones, is important because once they take hold and investments are made to implement them, they become difficult to change. The clinical trials data community needs to bring together the best available expertise in disclosure control and trials data analysis to develop de-identification methods that leverage best practices.

Policies and procedures

For sponsors to implement scalable de-identification in the enterprise, they need to develop policies and procedures to be consistent across trials and to facilitate training and automation. Below is a starting list of the types of topics that need to be covered:

  • Definitions of direct and quasi-identifiers for the types of trials that are being shared.

  • Definition of sensitive data and how these are dealt with.

  • Procedures for removing and pseudonymizing direct identifiers.

  • Risk metrics and thresholds.

  • Templates for the documentation that will be released with the parameters describing the de-identification that was performed.

  • Contract and/or terms-of-use templates.

  • Temporal validity of de-identification and procedures for re-analysis.

In addition to these, there will be a need for operational guidance (i.e., matters like where to get the data from, and which tools to use).


There are important reasons to share clinical trials data:27

  • Data sharing is almost a moral imperative to research subjects.

  • Collaboration is crucial to overcome the increasing complexity of science.

Furthermore, technology is evolving fast and providing new tools (both hardware and software) to generate scientific advances with “Big Data.” But Big Data needs access to data. This is an opportunity that should not be missed.

In this article we have provided an overview of de-identification practices and critical issues for sharing trials data. There are significant opportunities to accelerate trials data sharing and use, and to do so in a responsible way. As clinical trials data sharing becomes the norm, advancing and streamlining the processes for de-identification is mandatory.



Khaled El Emam, PhD, is Associate Professor & Canada Research Chair, University of Ottawa, Senior Investigator, Children’s Hospital of Eastern Ontario Research Institute, and CEO, Privacy Analytics Inc.; Kald Abdallah, MD, PhD, is Chief Project Data Sphere Officer, Project Data Sphere LLC

* The authors wish to thank Luk Arbuckle and Frank Petavy for providing feedback on an earlier version of this paper.



  1. Ebrahim S, Sohani ZN, Montoya L, and et al, “Reanalyses of randomized clinical trial data,” JAMA, vol. 312, no. 10, pp. 1024–1032, Sep. 2014.
  2. European Medicines Agency, “European Medicines Agency policy on publication of data for medicinal products for human use,” Oct. 2014.
  3. Institute of Medicine, “Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk,” 2015.
  4. Khaled El Emam, Guide to the De-Identification of Personal Health Information. CRC Press (Auerbach), 2013.
  5. K. El Emam and L. Arbuckle, Anonymizing Health Data: Case Studies and Methods to Get You Started. O’Reilly, 2013.
  6. Khaled E Emam, “Towards standards for anonymizing clinical trials data,” BMJ Blogs, 06-Dec-2014. .
  7. K. El Emam, Risky Business: Sharing Health Data while Protecting Privacy. Trafford, 2013.
  8. Christakis DA and Zimmerman FJ, “Rethinking reanalysis,” JAMA, vol. 310, no. 23, pp. 2499–2500, Dec. 2013.
  9. PhUSE De-Identification Working Group, “De-Identification Standards for CDISC SDTM 3.2,” 2015.
  10. Department of Health and Human Services, “Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule,” Department of Health and Human Services, 2012.
  11. “Health informatics. Pseudonymization,” ISO, International Standard ISO/TS 25237:2008, 2008.
  12. K. El Emam, E. Jonker, L. Arbuckle, and B. Malin, “A Systematic Review of Re-identification Attacks on Health Data,” PLoS ONE, vol. 6, no. 12, 2011.
  13. S. M. Meystre, F. J. Friedlin, B. R. South, S. Shen, and M. H. Samore, “Automatic de-identification of textual documents in the electronic health record: a review of recent research,” BMC Medical Research Methodology, vol. 10, no. 1, p. 70, Aug. 2010.
  14. C. J. V. Rijsbergen, Information retrieval. Butterworths, 1979.
  15. P. Kosseim and K. El Emam, “Privacy Interests in Prescription Data, Part I: Prescriber Privacy,” IEEE Security Privacy, vol. 7, no. 1, pp. 72–76, Jan. 2009.
  16. Article 29 Data Protection Working Party, “Opinion 4/2007 on the concept of personal data,” WP136, Jun. 2007.
  17. Health System Use Technical Advisory Committee and the Data De-Identification Working Group, “‘Best Practice’ Guidelines for Managing the Disclosure of De-Identified Health Information,” Canadian Institute for Health Information, 2010.
  18. Information Commissioner’s Office, “Anonymisation: managing data protection risk code of practice,” Information Commissioner’s Office, 2012.
  19. A. Sarpatwari, A. S. Kesselheim, B. A. Malin, J. J. Gagne, and S. Schneeweiss, “Ensuring Patient Privacy in Data Sharing for Postapproval Research,” New England Journal of Medicine, vol. 371, no. 17, pp. 1644–1649, Oct. 2014.
  20.  B. A. Malin, K. E. Emam, and C. M. O’Keefe, “Biomedical data privacy: problems, perspectives, and recent advances,” J Am Med Inform Assoc, vol. 20, no. 1, pp. 2–6, Jan. 2013.
  21. J. Kulynych, “HIPAA Compliance in Clinical Trials,” JOP, vol. 4, no. 1, pp. 9–10, Jan. 2008.
  22. American Health Lawyers Association, “State Healthcare Privacy Law Survey,” 2013.
  23. Khaled El Emam, “Economic comparison of options to anonymize clinical trials data,” Risky Business Magazine, Oct-2014.
  24. B. L. Strom, M. Buyse, J. Hughes, and B. M. Knoppers, “Data Sharing, Year 1 - Access to Data from Industry-Sponsored Clinical Trials,” New England Journal of Medicine, vol. 371, pp. 2052–2054, Oct. 2014.
  25. S. Hughes, K. Wells, P. McSorley, and A. Freeman, “Preparing individual patient data from clinical trials for sharing: the GlaxoSmithKline approach,” Pharmaceut. Statist., vol. 13, no. 3, pp. 179–183, May 2014.
  26. Bradley Malin, “A De-identification Strategy Used for Sharing One Data Provider’s Oncology Trials Data through the Project Data Sphere Repository,” Project Data Sphere, Jun. 2013.
  27. Hudson KL and Collins FS, “Sharing and reporting the results of clinical trials,” JAMA, Nov. 2014.
© 2024 MJH Life Sciences

All rights reserved.