Data Equity: Representing Underrepresented Populations


Amidst difficulty in attracting underrepresented populations, industry is searching for ways to include more diverse datasets.

The clinical trials world, despite strategy after plan after project, is barely making headway to attract diverse, underrepresented patient populations.1 Overall, African Americans, Asians, Latinx, and the elderly, are still saying, ‘not me.’

Trial sponsors and investigators know well why they hear that ‘not me’: Long-embedded distrust of the medical system; refusal to be randomized to a control arm; no convenient way to get to the trial site; and for some groups, even fears of technology and needles.2 On top of these, growing trial size and complexity are aggravating clinical trial costs and trial enrollment times.

Some sponsors are trying a new method to ensure they have underrepresented populations in their trials. Instead of warm bodies, they are pursuing digitized bodies, those patients whose health information lives in prior clinical trials, EHRs, claims data, prescriptions, urgent care locations, and so on. They are turning to data companies to create their control arms by finding digitized patients to match real ones in the treatment arm.

Henriette Coetzer, MD

Henriette Coetzer, MD

There is another method for creating a control arm, dubbed the digital twin concept: Statisticians take one patient’s age and another’s zip code and somebody else’s diagnosis, all because someone in the treatment arm has that age, zip, and diagnosis. Henriette Coetzer, MD, chief medical officer, recruitment and real-world evidence, CVS Health Clinical Trial Services, said this concept needs to be validated before it can be used in a regulatory-standard trial. Others interviewed for this article said they do not use twins either.

Real-world data, on the other hand, comes from records of real patients and are not adjusted or created, said C.K. Wang, MD, chief medical officer, COTA.

But how to refer to these control arms? In this relatively new field, terminology is not universal—to the point that for this article, different experts used different terminology to describe the same thing. The word that seems to be gumming up the lexical works is synthetic.

CVS, in correspondence, defined a control arm using synthetic information as real data collected outside of the clinical trial system to match participants in the treatment arm. Medidata describes its synthetic control arms as formed by carefully selecting patients from historical clinical trials to match the demographic and disease characteristics of the patients treated with the new investigational product.3

Thorlund et al4 describes synthetic data like this: Synthetic controls are defined as cohorts of patients from external data and adjusted using any of a variety of statistical methodologies.

C.K. Wang, MD

C.K. Wang, MD

“The term synthetic data now refers to an entirely new group of that is made up based on algorithms and do not correspond to an actual patient,” said Wang, in correspondence.

For the purposes of this article, an external control arm contains the data of real, de-identified patients.

While the gains to sponsors are obvious, time and money likely saved, so are the possible—maybe probable—drawbacks. Apple to apple comparisons—Macintosh to Macintosh—might likely be more like Macintosh to Cortland.

Furthermore, historical medical data are not representative of all populations, and can be incomplete. The human expertise required to adjust data that have been rooted in systemic, structural, and cultural bias might not be around, or exist, at the moment of adjustment.

Alex John London, PhD

Alex John London, PhD

The main concern, said Alex John London, PhD, Clara L West, professor of ethics and philosophy, and director, Center for Ethics and Policy, Carnegie Mellon University, is “the problems that stakeholders are often trying to overcome through the use of AI or other computational models are the result of problems that operate at a larger social level.”

Alexa Berk King, PhD, chief scientific officer, real-world evidence, CVS Health Clinical Trial Services, said FDA’s opinion, overall, has been clear. While synthetic control arms “are appealing and compelling, there is a lot of methodological hesitancy [regarding] what happens on the front and back end” of real-world data collection and analysis.

Users and creators of synthetic data have different business models. Some must buy the data, others, like CVS Health and Optum Insights, do not. CVS de-identifies its own digital diamond mine: retail pharmacy, insurance information, lab results, minute-clinic visits, “and all of the data elements that go alongside that,” said Coetzer. Optum Insights issues licenses to its users. CVS provides analytic services for its customers.

Medidata AI creates its arms from a pool of 30,000 clinical trials representing nine million people. COTA accesses healthcare systems and providers for clients’ strictly oncology-focused external control arms.

What got, and keeps us, here

Wang said a trial’s diversity bar has always been low. Historically, clinical trial sponsors have listed few denominators—sex and age, ethnicity, and race.

Consider research from the American Study of Kidney Disease and Hypertension, conducted 27 years ago, that showed African Americans with diagnosed hypertensive renal disease, a common condition among African Americans,5 were poorer and more unemployed than their peers in the general population.6 Later work showed that genetic variants, like the Apolipoprotein L1 (APOL1) gene, can exist among different populations.7

Recently, IQVIA detailed how Black representation in trials has decreased: in 2013, Black participation was 12.3%. In 2021, it was 6.5%. However, Hispanic representation rose 2.5%, to 9.9% in 2017.8

CDER detailed who was missing in its Drug Trial Snapshots Summary Report 2021.9 Of the approved 50 novel therapies, among the 11 heart, blood, kidney, endocrine disease therapies. four listed N/A under headings for various populations. In a medication for risk reduction of kidney and heart complications in chronic kidney disease associated with type 2 diabetes, 5% were African American.

FDA, required by the 21st Century Cures Act to come up with a diversification plan, recently issued draft guidance on the use of external control arms in a clinical trial.10 The guidance discussed the Macintosh vs Cortland issues, like differences in data collection times and possible changes in standards of care between the control arm’s original trial and the experimental treatment arm.

“(The) FDA is monitoring how the research community is exploring the use of synthetic data and will stand ready to provide regulatory clarity as needed,” said an agency spokesperson.

The agency already has approved at least one medication based on trial results that included an external control arm; so has EMA. The National Institute for Health and Excellence (NICE), in a review, found that of 489 applications, 22 used external data. Of these, 13 used published RCT data, and six used observational data. More than half of the applications came in the last two years.4

Pressure is coming from elsewhere; BioEthics International plans to score pharma companies on how diverse their trials are.11

Data pools, data oceans

Overall, the number of synthetic data creators is growing. In 2021, 67 synthetic data vendors, of all types, were in business. As of October 2022, there were 100 of them, according to

It seems like new companies are being formed every day, said Wang, adding there is an immense need for data in the healthcare ecosystem.

Examples of the pool sizes include the following:

  • MDClone, based in Israel, has 30 years of data in its data lake, everything from physician notes to genetic markers to social determinants of health, according to its website. The data come from 20 health care systems and HMOs from the US, Canada, and Israel. All told: 50 million synthetic datasets.
  • Optum, owned by United Healthcare Group, has a real-world data set with more than 70 million linked clinical and claims lives, 150 payers and up to 15 years of patient history. Optum data, says its website, is statistically certified as de-identified by a third party. Some data are available at the zip code level. An IRB approval allows access to patient and physician surveys and integration of the users’ data.13
  • CVS has its 9,900 retail locations, 1,100 walk-in medical clinics; its PBM has 110 million plan members along with additional 35 million people in other programs, including Medicare Advantage and Medicare Part D prescription drug plan.

Whittling down the data

The creation of synthetic data, said London in email correspondence, can be a complicated and delicate process and so some may do a better job than others. But, these efforts will only be as good as the knowledge that stakeholders bring to the table.

Alexa Berk King, PhD

Alexa Berk King, PhD

Berk said the CVS data are scrutinized for missingness and outliers, like out-of-range values lab data. If that lab data are outside existing parameters, “It is not our place to go in and make interpretations.” If that missing piece cannot be statistically fixed, it is tossed. Berk estimated that at least 10% of standard data is not analyzable. She noted that her group does not sell, loan, or license data itself. “We use our internal teams to generate insights and evidence from our data, and those insights are what we deliver. We are not selling it out the back door.”

Reaching the diversity goals for a synthetic control arm takes some doing, considering the lack of the underrepresented in so many health care data streams. Wang said that depending on a clinical trial’s inclusion and exclusion criteria, it is not uncommon to end up with only a limited number of usable records from starting with thousands to get down to hundreds of records. Synthesizing is tricky: “You have to find a cohort of real world or synthetic patients subjects need to very closely to nearly match the relevant and critical inclusion-exclusion criteria of a trial.”

COTA uses real-world data and oversees the entire process from abstraction and processing of raw data to final delivery of a curated dataset. Clients, which include pharma and payers, can trace the data to its origins and observe any data transformations, Wang said.

Coetzer argues that overlapping the myriad digital layers of patient information will unearth those patients who can be included in a synthetic arm. Knowing the physician, the diagnosis, the codes, the procedures, medications, tests and the results, “these clusters give us a lot of information around the general specifics of the individual.” The data analysis results, she continued, will tell the team what is missing, and how that missing information should be filled. Coetzer said CVS data has been included in at least 100 publications.

Furthermore, Coetzer said CVS retail stores are so widely distributed that 85% of the US population lives within a 10-minute drive of one. That distribution, she said, allows CVS’s data universe a provider-agnostic view of its patient population, and to include real-world observations for validation purposes. CVS can match its synthetic control patients to the CDC’s social vulnerability index, finding patients who have lacked access to health care.

Challenges here, concerns there

As Wang said, synthesis can get tricky.

At the front end, said Berk, besides the exclusion and inclusion criteria, investigators must make sure that head-to-head comparisons are controlled for when there is a time bias. Once the trial is going, are the real-world observations from the synthetic control arm being compared to those in the treatment arm? “On the back end, we have to account for [these differences] when possible,” she said, and that is done by drawing inference from the control’s real-world data. In the real world, she said, tumor response isn’t assessed every couple of weeks and blood isn’t drawn weekly.

As for the ethics of creating these arms, those interviewed disagreed. The issue, said Wang, is not whether the use of real-world or synthetic data is ethical; it is the incurred cost to patients and society resulting from delayed treatment approval from delayed clinical trial accrual. From his experience, at least 50% of his cancer patients approached to participate in a clinical trial have refused if they could not be guaranteed a place in the treatment arm. “Delaying or withholding potentially effective therapy, especially in a life-threatening situation, is a huge ethical dilemma.”

Christine Bahls is a freelance writer for medical, clinical trials, and pharma information.


  1. Clark LT,Watkins L, Piña IL, et al. Increasing Diversity in Clinical Trials: Overcoming Critical Barriers, Current Problems in Cardiology, Volume 44, Issue 5, 2019.
  2. Milani SA, Swain M, Otufowora A, Cottler LB, Striley CW. Willingness to Participate in Health Research Among Community-Dwelling Middle-Aged and Older Adults: Does Race/Ethnicity Matter? J Racial Ethn Health Disparities. 2021 Jun;8(3):773-782. doi: 10.1007/s40615-020-00839-y. Epub 2020 Aug 17. PMID: 32808194; PMCID: PMC7431111.
  3. Medidata Synthetic Control Arm® Supported by the US Food and Drug Administration (FDA) for Use in Medicenna Therapeutics, Corp. Phase 3 Registrational Trial in Recurrent Glioblastoma. Oct. 28, 2020.
  4. Thorlund K, Dron L, Park JJH, Mills EJ. Synthetic and External Controls in Clinical Trials - A Primer for Researchers. Clin Epidemiol. 2020 May 8;12:457-467. doi: 10.2147/CLEP.S242097. PMID: 32440224; PMCID: PMC7218288.
  5. National Kidney Foundation. Race, ethnicity and kidney disease.
  6. Wright JT Jr, Kusek JW, Toto RD, et al. Design and baseline characteristics of participants in the African American Study of Kidney Disease and Hypertension (AASK) Pilot Study. Control Clin Trials. 1996 Aug;17(4 Suppl):3S-16S. doi: 10.1016/s0197-2456(96)00081-5. PMID: 8889350.
  7. APO Friedman DJ, Pollak MR. Apolipoprotein L1 and Kidney Disease in African Americans. Trends Endocrinol Metab. 2016 Apr;27(4):204-215. doi: 10.1016/j.tem.2016.02.002. Epub 2016 Mar 3. PMID: 26947522; PMCID: PMC4811340.
  8. Advancing Diversity in Clinical Development through Cross-Stakeholder Commitment and Action - IQVIA
  9. FDA. Drug Trial Snapshots Summary Report 2021
  10. FDA. Considerations for the Design and Conduct of Externally Controlled Trials for Drug and Biological Products. Guidance for Industry. February 2023. Real-World Data/Real-World Evidence (RWD/RWE).
  11. Bioethics International Publishes New Index to Score Pharma Companies on Clinical Trial Diversity in BMJ Medicine
  12. Devaux E. Everything that happened in the synthetic data space in 2022 | by Elise Devaux Oct. 6 2022.
  13. Optum. Maximize your investments with a more coordinated and connected real-world data strategy.
Related Videos
Related Content
© 2024 MJH Life Sciences

All rights reserved.