Progress Report: Pharma Adoption of Data Analytics

Published on: 
Applied Clinical Trials, Applied Clinical Trials-02-01-2023, Volume 32, Issue 1/2

Assessing industry’s use and integration of clinical data science across business.

In 2021, a Tufts Center for the Study of Drug Development study showed that a typical Phase III clinical trial generates 3.56 million data points.1

If we consider that number from another angle —that just 20% of clinical trial data are analyzed, as opposed to 80% that are stored—it could illustrate the level of application of data science used in the pharmaceutical industry. “We have been slow to adopt [artificial intelligence]-driven solutions,” said Wendy Morahan, senior director, clinical data analytics, for IQVIA. She made her remarks, and cited the 20-80 breakdown, in a webcast broadcast last year.2

Morahan’s view is shared inside and outside the industry.

“A lack of evolution within the pharmaceutical industry’s analytic capabilities has given rise to inefficiency and a failure to leverage modern technologies within the clinical development space,” says TransCelerate BioPharma Inc. on its website.3

Adoption of data analytics in the pharma industry has been slower than in other industries, says Andrew Lo, PhD, Charles E. and Susan T. Harris professor, a professor of finance, and the director of the Laboratory for Financial Engineering at the MIT Sloan School of Management. In research over the last decade, the Lo team has shown how data science, using scores of factors, can more accurately predict a drug’s approval success.

But back to the 3.6 million data points. According to the 20-80 breakdown, a little math shows that if 720,000 data points are analyzed and 2.8 million remain warehoused, a lot of data is left on the cutting room floor. (A 2018 Harvard Business Review article also discussed that percentage breakdown.4) The Tufts study also showed that Phases II and III now include 263 procedures per patient, with the generated data used to produce 20 endpoints. These data illustrate the yin and yang of decentralized clinical trials (DCTs): easier on patients, yet often requiring more data integration and analysis.

That data-braiding effort is exactly the sweet spot. Integrating different types of data, including socioeconomic, electronic health record (EHR)- generated information, claims data, and financial information, can provide clearer pictures of health at population levels and holistic pictures at the personal level.5,6

Data analytics, according to interviews and published studies, as the Lo lab has shown, can de-risk financial investments by quantifying a product’s chances for approval; cut costs in clinical trials, including with the elimination of a human control arm and replacing it with a synthetic one; and find evidence, early on, of a drug’s possible harmful effects.

Not everyone buys the argument that pharma in toto is behind the times. Longtime pharma consultant Craig Lipset says in his experience, the opposite is true. “Every major pharma company has [invested in] tremendous executive-level hiring and built massive organizations” around the theme of data science across their business, he tells Applied Clinical Trials.

Pharma, he continues, is entering into deals with AI companies and hiring intensively. He notes a Janssen data science symposium at which he spoke last year had “massive” attendance. Lo also had a similar story regarding a Novartis internal conference at which he spoke; hundreds of data scientists were present, he says.

Adds Lipset: “I’d say the question now is if these investments are paying data science really the answer or is basic science just really hard?”

Evidence of blanket, c-suite adoption of data analysis might be debatable. But pharma professionals who deal with digits day-to-day are joining groups using data analysis—as much as a disparate health system like ours can—to standardize data, create universal languages, and streamline other processes. These individuals, along with academics and the federal government, are banding together to figure out how to base the clinical trial world, and healthcare improvement, on data science.

For the most part, these people, thousands of them, are volunteering their time—with their employers’ blessing, and in some cases, with their employers’ money, via funding donations or membership fees.

From afar, pharma’s strategy here, if there is one, resembles its approach to R&D: why do it internally if a biotech or asset exists to fill the need?

That said, the volunteers are not competitively shackled, whether volunteering in the preclinical or post-market surveillance space.

“We are committed to better healthcare, we each contribute what we can do,” says George Hripcsak, MD, the Vivian Beaumont Allen professor of biomedical informatics, Columbia University, and director of the OHDSI Coordinating Center. The goal here isn’t for “someone else to make money or because they have a grant,” he adds.

The many acronyms of data analysis

The uninitiated might find the array of acronyms dizzying: OHDSI (Observational Health Data Sciences and Informatics), DiMe (Digital Medical Society), DTRA (Decentralized Trial Research Alliance), Vulcan HL7 FHIR (Fast Healthcare Interoperability Resources).

These organizations and consortiums, regardless of their stated mission, be it language standardization, more rapid advancement use of DCTs, or, in the case of TransCelerate, to improve how clinical trials operate, all see data integration as a requirement in improving efficiencies and in the end, healthcare. “Clinical research, healthcare, and market data are captured across multiple sources in multiple structures preventing system interoperability that, when connected, will enable and drive insights, as well as allow for the integration of various data sources,” according to TransCelerate.7

The people creating these pools, or in some cases oceans, of data, do so for just members of their particular group, like the Duke University Health system,8 or, in the case of OHDSI, for anyone who wants to dive in. Other organizations, such as the NIH’s National Center for Advancing Translational Sciences (NCATS) provides funding through its Clinical and Translational Sciences Awards (CTSA) program to help awardees create collaborations between their networks “to implement, assess, and/or disseminate discoveries across the network.”9


Some entities’ missions are to make running clinical trials easier and more efficient. The Medical University of South Carolina has built an open-source network with dozens of members in healthcare systems and research institutions; the Universities of Iowa and Utah pitch in to help run the open source functionality.10 Built on CTSA funding, the system has two components: one takes researchers’ requests for specific types of data for help with biostatistics, and so on; while the second component integrates six networks of data submitted by the participating members. These data networks include EHR patient information, the spending tracker SmartStream, and Click, an institutional review board (IRB) submissions and tracking product.11

The second system, dubbed RINS (Research Integrated Network of Systems), says Leslie A. Lenert, MD, chief research information officer, Medical University of South Carolina and director, Biomedical Informatics Center, allows the center “to match controls. We can do faster enrollment, we track time to start [a trial], from the day the grant arrives to first enrollment,” he explains. The point, adds Lenert, is to reduce time and money spent on a trial, from the time used in its design, acquiring IRB approval, and so on. “All those processes take time at a given site,” he says.

OHDSI, which is coordinated by the Columbia University Department of Biomedical Informatics, has an extensive publication track record—2,000-plus authors, 500-plus articles; 3,000 members from 80 countries.

The OHDSI collaborators page includes the Who’s Who in pharma: Amgen, Bayer AG, Johnson & Johnson (Janssen), Merck—along with universities, research hospitals, IQVIA, and Oracle. It was the predecessors of OHDSI who built the common open-source platform known as OMOP, or Observational Medical Outcomes Partnership.

Since 2009-2010, when OMOP went operational, it has been used to convert 928 million patient records to this common data model, 12% of the world’s population, says Hripcsak. Pharma, he adds, donates money or pharma employees donate their time, benefitting the company. Funding also comes through NIH grants and FDA convener grants. OHDSI works with others as well, including FDA, on a devising common vocabulary, data standards, analytic methods, and software tools that are all open sourced. The NIH has recently asked that research applicants familiar with OMOP to use that language in their request for applications.

A discussion on pharmacovigilance

One area that data analytics is taking hold is in pharmacovigilance. OHDSI has an active group that has published multiple reports on topics such as the accuracy of an open-source knowledge base and finding negative controls.12

Adverse events (AEs) are an important area for concern. Reports of death from AEs have gone up—in 2016, there were 141,851 reports of death; in 2021, 187,493. Reports of serious events during that time also rose, from 825,858 to 1,373,116.13

Precision pharmacovigilance is not just a regulatory obligation anymore, says Bruce Palsulich, vice president, product strategy, Oracle. Pharmacovigilance has become a highly curated dataset, an information asset. “Can you leverage this even preclinical? What is the risk profile of certain targets during discovery?” he asks. “That shift toward a predictive and leveraging a pharmacovigilance dataset can be used throughout the clinical trial lifecycle; we are seeing it used.”

With a description of the event, adds Palsulich, humans could read the abstract to get the information or leverage AI to extract it, and then have a human do the quality review. That, he says, could eliminate a high amount of intake process, likely 50% or more. These techniques would allow the intake of information to scale broadly, he explains, and inform more about the safety process. The intent is to move “toward having adverse events in the database to do analytics, find the potential signals or risks, or increasing frequency.”

While there are certain AI improvements that will drive cost reductions, not every use case will have a measurable return on investment, per se, according to Palsulich. For pharma, the real gain is with differentiating competition: “We will continue to see pharmacovigilance as being more of a valued asset than a regulatory” obligation, he tells Applied Clinical Trials.

Education and the realities of data integration

Ken Gersing, MD, director of information, division of clinical innovation, NCATS, knows about the importance of data harmonization and creation: He is part of the team that created N3C, the nationwide COVID-19 data tracking system. He equated how data integration should be handled to a car assembly line.

“Think of the Ford River Rouge auto factory that Henry Ford built,” Gersing said in an email. “Steel went in the door and out popped a car at the end of the assembly line. Data is similar in that in order to use it at every stage, you need to know how each step impacts the next. If investigators don’t understand the data or if the data is transformed incorrectly the conclusions could be wrong.”

And that final result, he said, very much depends on adequate training; just having the data does not promise effective clinical decision-making or anything else.

Added Gersing: “We need an intelligent workforce that understands how algorithms work, the data that created them, their limitations, the impact of new data, and the context for which they are intended. If not, the car will just not work properly.” 

Think about the impact of wrong information going into a clinical decision support system, he continued. If the conclusions are in a peer-reviewed journal, the author will receive feedback on their work but clinical care is not impacted. But in this country’s modern health care system, decision support could be embedded in a health system EHR.If this were happen, the misinformation would be propagated to all the health care providers.

He and MIT’s Lo said missing data is a huge issue in the field. “All insights are based on the quality, accuracy, and comprehensiveness of that data,” explained Gersing. “The less missing data I have, the more I can say it helps [the patient].”

And so back to the discussion of pharma and its direct involvement with data scientists. Lo, who is now well-known in large pharma circles, says the answer is mixed, and depends on which company that is under discussion.

But in general, “if they don’t hire data scientists they will be left behind,” he asserts.

Editor’s note: For more information on clinical endpoint standardization efforts, read this article.

Christine Bahls is a freelance writer for medical, clinical trials, and pharma information.


  1. Smith, Z.; Bilke R.; Pretorius, S; Getz, K. Protocol Design Variables Highly Correlated with, and Predictive of, Clinical Trial Performance. Therapeutic Innovation & Regulatory Science. 2022, 56 (2), 333–345.
  2. Ensuring Data Quality in a Complex Trial Landscape. IQVIA webinar, 2022.
  3. Modernization of Statistical Analytics. TransCelerate BioPharma Inc. summary page.
  4. Browne-Anderson, H. What Data Scientists Really Do, According to 35 Data Scientists. Harvard Business Review, August 15, 2018,
  5. Clay I.; Angelopoulos C.; Bailey A.L.; et al. Sensor Data Integration: A New Cross-Industry Collaboration to Articulate Value, Define Needs, and Advance a Framework for Best Practices. J Med Internet Res. 2021, 23 (11), e34493.
  6. Ratitch B.; Rodriguez-Chavez I.R.; Dabral A.; et al. Considerations for Analyzing and Interpreting Data from Biometric Monitoring Technologies in Clinical Trials. Digit Biomark. 2022, 6 (3), 83-97.
  7. Our Strategy. TransCelerate BioPharma Inc. summary page.
  8. Hurst J.H.; Liu Y.; Maxson, P.J.; et al. Development of an Electronic Health Records Datamart to Support Clinical and Population Health Research. J Clin Transl Sci. 2020, 5 (1), e13.
  9. NIH, Limited Competition: Administrative Supplements to Enhance Network Capacity: Collaborative Opportunities for the CTSA Program (Admin Supp).
  10. He W.; Sampson R.; Obeid J.; et al. Dissemination and Continuous Improvement of a CTSA-based Software Platform, SPARCRequest©, Using an Open Source Governance Model. J Clin Transl Sci. 2019, 3 (5), 227-233.
  11. Borfitz D. New ‘Research Data Mart’ to Help Academic Sites Track Trial Performance. Clinical Research News, May 10, 2021,
  12. Pharmacovigilance Evidence Investigation Workgroup. Observational Health Data Sciences and Informatics,
  13. FDA, Adverse Events Reporting System (FAERS) Public Dashboard.