Leveraging a Unified Data Model to Drive Collaboration and Clinical Trial Efficiency


Applied Clinical Trials

This article will focus on a case study example of a collaborative effort to share investigator, site, and study information across companies, and on the single data model that underpins it.

We all know that the clinical trials industry is beset with inefficiencies, meaning it takes too long and costs too much to bring a new medicine to the patients who need it. There are clearly many opportunities to gain efficiencies, but one area that has been a particular focus recently is the promise of technology and data. 

Both pharmaceutical companies and CROs have access to a wide variety of technology and data related to the conduct of clinical trials, but in many cases a lack of sufficient standards is stifling their ability to integrate across different technologies and data sources, leaving them unable to achieve the promise of efficiencies. Instead of streamlining the decision making process, companies spend a significant amount of time on sequential use of non-integrated technologies, on manual processes to match across data sources, and on establishing definitions of clinical trial terms to enable a like-with-like comparison and exchange of data between partners.

Recently, however, there has been a notable shift. What was once an individual company’s confidential information, has now become more narrowly defined such that historical competitors are open to collaboration on data models including common definitions and lists of values. The goal here is to leverage a single data model to enable connectivity for a broad range of clinical operations technology solutions and data, thus generating efficiencies for all players without compromising a single company’s “secret sauce” (e.g., compound, protocol design, processes, data algorithms).

This article will focus on a case study example of a collaborative effort to share (upon obtaining consent) investigator, site, and study information across companies, and on the single data model that underpins it. Specifically, the article describes a single data model is being used by 15 major CROs and pharmaceutical companies (including members of ACRO, the TransCelerate Investigator Registry, and the Investigator Databank) to match and master data from Clinical Trial Management Systems (CTMS). Once aligned to a data model, companies are able to integrate with an end-to-end suite of clinical operations solutions. In the study planning phase, integration enables a unified view of data sources, including sharing across companies with appropriate permissions supporting trial enrollment planning, country selection, and site/investigator identification. During the study conduct phase, this same integration powers site and sponsor user management within electronic trial application modules such as site activation, learning management, site engagement, and safety notifications. Finally, companies can receive the unique identifiers and associated data back into their own internal systems to support master data management, so they can access this ‘one single source of the truth’ across all their internal systems.

Unified Data Model

 “Why has it been so hard to agree on a data model”?  The answer to this is two-fold: 

1) Each company installment of CTMS is unique (even when purchased from the same technology provider) so there is significant variation in data fields, names, and formats 

2) There was no way to match investigators and sites across companies and systems

To overcome these challenges, we facilitated a series of workshops with four leading pharmaceutical companies to explore similarities and differences between companies. Through this investigation, we found there was variability in how companies structured their data from more than one direction.   

· Syntactic: different sets of data fields; varying nomenclature for similar data field labels; alternative vocabularies (ICD-9, MedDRA, MeSH) and terminology; non-compliance with international coding (e.g. ISO)

· Semantic: different meaning for same fields (e.g. site recruitment dates), various definitions for types of data (e.g. study phase) 

Because of the significant variation in each implementation, we decided to develop a unified data model including a stand-alone file specification to which each of the companies could easily map their own internal CTMS. Ultimately, it took 6 months of collaboration to generate the initial specification, and the conversation is ongoing.  

The specification defines:

· Data model: their purpose, definition, format, type, size, and allowable values

· Mappings of vocabularies and controlled lists of values (LOVs)

· The business rules that are applied when importing data files

o Ensures common meaning between companies

o Sets the governance of allowable sharing of data between companies

o Enables calculations for key analytics and metrics

Data Model

At a high level, the data model describes the following:

· Persons: Name, Email, Phone, Role, Training, Degree

· Facilities: Name, Address, Phone

· Studies: Protocol Number, Title, Phase, Milestone Dates, and Enrollment Numbers

Where possible, the field format complies with established data standards such as dates in YYYY-MM-DD format per ISO8601 and countries/regions based on ISO3166. For a large number of fields, however, a new LOV was needed to enable cross company data interoperability. Examples of where new standardized lists needed to be created include, but are not limited to:

· Degree (typically a text field in CTMS today)

· Role (unique to each company)

· Phase (company-specific variations such as Phase I, Ph 1, etc.)

· Specialty (two-level list, primary specialty and sub-specialty, applicable across all countries)

· Department type (two-level list, department and sub-department, for large research facilities)

· Study site status (unique to each company)

Rather than try to modify each customer’s source system, the file importing process contains a customer-specific mapping of internal values to the model such that the output database conforms to the Investigator Registry terminology.

Unique Identifier for Persons and Facilities

While the file specification and underlying data model and file solved the issue of producing uniform data, we also needed to build a system for matching both clinical research personnel and sites across systems to achieve an interoperable dataset. Most systems that match across sources today do so based on email, however, our experience in the clinical research space suggests that email may be missing and does not uniquely identify a person (e.g., site-level mailbox used by all or a person who works at two different sites with a separate email for each site).

Supporting matching of clinical research staff identities (as opposed to email addresses) across data sources on parameters was a complex challenge. We initially developed a canonicalization tool based on a series of algorithms that support a probabilistic match to assign a unique identifier called the DrugDev Golden Number independently for persons and research facilities. However, as we started to process actual data, we needed to fine tune the model to account for missing, multiple or shared emails as well as to be able to address the high variability in the source CTMS data (e.g., nicknames vs. given name, inclusion of middle name/initial in first name, spelling mistakes) and, incorrect information (e.g., first name = not, last name = available or first name = @#$12!).

Ultimately, we settled on a three step process including:

1) Data cleaning: 

·  Removal of incorrect data and mapping of company-specific lists to the LOVs in the data model

2) Matching:

· An automated probabilistic match based on an expanding set of business rules for linking clinical personnel and facilities

3) Manual intervention:

· Review of records with an inconclusive result with tools for linking and unlinking clinical personnel and facilities

While we are continuously looking for ways to improve the automated matching algorithms, we had to build the model to be conservative so as not to over-link clinical personnel. Doing so avoids a privacy risk to the cross-company collaboration and ensures that clinical trial records are shared in accordance with the rules of the collaboration and within appropriate privacy guidelines.


Today, both pharmaceutical companies and CROs are using this unified data model and unique identifiers to facilitate data matching and integration with CTMS systems, public data, 3rdparty providers, and identity management providers. In addition to individual company adoption, as mentioned previously, this data model and unique identifiers are also being used by industry collaborations such as the Investigator Databank and TransCelerate’s Shared Investigator Platform and Investigator Registry. Leveraging the shared conventions described here reduces the manual effort required by pharmaceutical companies to integrate multiple sources of information for the purpose of creating actionable insights. For example, this enables more seamless working between pharmaceutical companies and CRO partners, integration of information from clinical trial registries (e.g. clinicaltrials.gov) with that from company CTMS sources, and the ability to view person and facility profile information integrated with data from multiple other data sources.

We have performed a detailed return on investment assessment to quantify the potential value of cross-company collaboration, with a focus of looking at the impact of data integration and data sharing on study planning, site selection, site start-up, and internal master data management1. Nearly 75% of the value of data integration and sharing to an average company is driven by three benefits: more informed decision making due to availability of site-level details, increasing investigator engagement, and decreasing time spent by staff and CRAs on site start-up. Companies are also experiencing decreased time and costs for data mastering, and increased investigator engagement, which can improve study start-up timelines (as well as helping to address the industry-wide issue of high investigator turnover).  

Outside of the time saved by leveraging a single model to drive a full range of clinical operations activities, companies also benefit from more efficient use of clinical trial technology, with associated improvements on enrolment, patient retention, protocol compliance, data query resolution, and study conduct efficiency. There are also more over-arching operational benefits including: more accurate and up-to-date data, having access to one single source of truth across all systems and studies, and providing integrated reporting across programs and therapeutic areas. 

This case study demonstrates that a common technology approach can benefit the entire industry. But, to be successful, a unified data model must be developed in collaboration with different types of stakeholders in order to reflect the broad variation in the marketplace observed today.  Once developed and adopted, a unified data model can be used to drive internal company efficiencies as well as cross-industry collaboration to share information and ultimately realize acceleration of drug development based on improvements in the clinical trial process. 



1) Sears C, Cascade E, Klein, T.  The Data-Decision Debate: To Share or Not to Share?

Applied Clinical Trials Volume 25, Issue 8, Aug 01, 2016



Claire Sears, PhD, is Product Communications Director and Elisa Cascade, MBA, is Chief Product Officer, both affiliates of DrugDev, an IQVIA Company.

© 2024 MJH Life Sciences

All rights reserved.