Key findings on unstructured data and eSource
- More than 80% of healthcare data remains unstructured, limiting the full value of current eSource implementations.
- AI and NLP can reliably structure text, imaging, and reports, but regulatory-grade use hinges on validation, traceability, and governance.
- FHIR and OMOP enable interoperability, but local data quality and mapping remain critical trust factors.
- Safety monitoring, medications, and imaging emerged as the most viable near-term use cases.
- Federated and continuous validation models are seen as essential to scaling eSource beyond large academic centers.
1. Introduction
Electronic source data (eSource) refers to the direct digital capture of clinical information at its point of origin—whether from electronic health records (EHRs), laboratory systems, electronic patient-reported outcomes (ePRO), wearables, or digital consent platforms. The goal is consistent: to reduce duplicate entry, improve accuracy, and accelerate data flow into clinical trials.1,2
Within this broader concept, EHR-to-EDC has become the most widely implemented eSource pathway. By securely transferring data from EHRs into electronic data-capture (EDC) systems, it reduces manual transcription, supports regulatory compliance, and lowers site burden. Benefits include fewer transcription errors, faster database lock, and measurable cost savings.3 These successes form the foundation for the next frontier: activating unstructured data domains that remain largely untapped.
More than 80% of healthcare information remains unstructured, locked in free-text notes, pathology and radiology reports, or imaging files.4,5 In oncology and rare-disease trials, between 45-70% of trial-relevant variables exist only in unstructured formats.6 Unless these sources are incorporated, eSource adoption risks plateauing at partial coverage. This fragmentation limits reusability, slows trial start-up, and sustains reliance on manual abstraction—one of the largest hidden costs in clinical research.
The challenge extends beyond technology. Data quality, interoperability, and validation requirements remain uneven across institutions, while standards such as HL7 FHIR and the OMOP Common Data Model continue to mature in operational deployment.7-10 Emerging artificial intelligence (AI) and natural-language-processing (NLP) methods now provide new ways to structure text, link multimodal data, and automate mapping into shared models.11-13 Yet these innovations raise important questions about validation, governance, and regulatory acceptance, as well as the readiness of hospitals and vendors to implement them at scale.
This study explores how key stakeholders—hospitals, sponsors, and technology providers—perceive the technical, operational, and regulatory pathways for integrating unstructured data into eSource-enabled clinical trials. Through semi-structured interviews and synthesis of recent initiatives, we identify where progress has been made, where validation gaps remain, and what collaborative models can accelerate trustworthy adoption.
2. Methodology and respondents
This study draws on nine in-depth expert interviews conducted between June and September 2025, complemented by a review of recent literature. Respondents were selected to represent diverse perspectives across the clinical research ecosystem, including healthcare providers, life sciences sponsors, and technology vendors. Interviews were semi-structured, based on a shared guide, and transcripts were thematically analyzed to identify converging insights and key divergences. Collectively, the respondents exemplify the three main stakeholder groups shaping the future of eSource adoption:
- Hospitals and care providers, bringing the perspective of how unstructured data is generated and managed within clinical practice.
- Pharmaceutical sponsors, who set requirements for regulatory-grade data in clinical trials.
- Vendors and technology innovators, developing the tools and platforms needed to operationalize eSource at scale.
Respondents included:
- Sarah Burge, Director of Clinical Integration, Cambridge University Hospitals
- Lars Fransson, Strategy Lead at AstraZeneca R&D, AstraZeneca
- Adriano Garcez, Senior Director, ZS Associates
- Chris Harrison, Senior Director, Clinical Programs, AstraZeneca
- Joeri Holtzem, Clinical Data Standards Specialist, Johnson & Johnson
- Gabriel Maetzu, Founder and Chief Medical Officer, IOMED
- Thomas Metcalfe, SP Life Sciences, Evidentli
- Felix Nensa, Professor of Radiology and Senior Consultant, University Hospital of Essen
- Steve Tolle, Chief Product & Technology Officer, IgniteData
- Richard Yeatman, Chief Operating Officer, IgniteData
The interviews provide the empirical foundation for this analysis. Affiliations are listed here for context; in the following sections, respondents are referred to by name only to streamline the narrative.
3. Strategic relevance
Across healthcare systems, roughly 80% of clinical information remains unstructured, typically found in free-text notes, diagnostic reports, and imaging files.4-6 In data-intensive domains such as oncology, much of the trial-relevant information still resides exclusively in narrative form—within pathology narratives, radiology findings, and clinician observations that rarely appear in coded EHR fields. Unless these sources are captured and linked to structured datasets, eSource adoption risks plateauing at partial coverage.
Unstructured data hold critical contextual detail that can clarify eligibility, disease stage, and treatment response. Leveraging this information through natural-language processing (NLP), optical character recognition (OCR), and multimodal AI enables sponsors and hospitals to recover insights previously invisible to standard data-capture workflows. When combined with interoperability frameworks such as FHIR and the OMOP Common Data Model, these techniques enable harmonization across institutions while preserving provenance and regulatory traceability.7-10
Yet technology alone cannot unlock the full value of unstructured data. Differences in documentation styles, variable EHR maturity, and inconsistent governance policies still constrain reuse at scale.11-13 Progress therefore depends on coordinated validation frameworks and collaborative models between hospitals, sponsors, and technology providers—an approach explored in later sections of this paper.
Figure 1 illustrates the relative proportions of structured and unstructured data across healthcare and clinical-trial domains, emphasizing how narrative information complements coded variables in representing patient context.
4. Technical pathways to structuring data
Turning unstructured health data into trial-ready formats requires both advanced technologies and robust interoperability frameworks. Respondents agreed that while the technical potential is now proven, scalable implementation depends on coordinated progress across institutions and therapeutic areas. As shown in Figure 2, unstructured data from clinical notes, reports, images, and medication records are processed through AI-driven extraction, standardized using FHIR and OMOP models, and validated through rules, human oversight, and audit trails. The circular design emphasizes continuous validation and reuse at the institutional level rather than one-off trial preparation.
4.1 AI and Multimodal Data Processing. AI—particularly natural language processing (NLP) and large language models—is central to transforming unstructured text into structured variables. Traditional rule-based methods still serve narrow purposes, but modern NLP captures complex relationships across notes, reports, and even images. Richard Yeatman explained, “We could all draw this on a diagram and show a way now with modern technology and methods that we could do all of it today. The question is what’s the journey to get there to digitize all these things?” Respondents highlighted the emergence of “agentic AI,” where multiple models check and validate one another’s outputs, improving confidence and auditability.
For radiology and pathology, multimodal AI combining text and images is advancing rapidly. Steve Tolle noted, “We are now at a point where some AI technology can do true volumetric measurements of lesions over time. That could change how tumor response is calculated and open imaging to automated use in trials.” Such integration could accelerate oncology studies by transforming imaging endpoints into validated digital evidence. However, scaling these models into regulatory-grade eSource workflows demands new infrastructure, rigorous documentation, and continuous monitoring to ensure transparency and traceability of algorithmic outputs.
4.2 Structured Domains and Interoperability Standards. Medications were widely cited as a practical early target for AI automation. Concomitant medication records often vary in format—ranging from dropdown menus to free text or scanned PDFs. Yeatman cautioned, “Even structured domains vary enormously by site. What looks clean in one hospital can still be scanned PDFs in another.” Despite this, medications are viewed as a feasible near-term success because drug names and dosages can be normalized against dictionaries such as ATC and RxNorm, allowing automated validation. Steve Tolle added, “…and also pick up concomitant medications and adverse events out of the EHR.” These features make medication data a “transitional domain,” bridging structured and unstructured sources as eSource scales.
Standards such as HL7 FHIR and the OMOP Common Data Model (CDM) provide the backbone for interoperability. FHIR enables real-time data exchange, while OMOP harmonizes datasets retrospectively for population-level analysis. Gabriel Maetzu cautioned, “I just see them as a vehicle. Too many people are focused on a certification or a standard as the end goal rather than capturing the real value of the underlying data.” Thomas Metcalfe added, “OMOP gives us the ability to normalize across hospitals, but unless sites have reliable mapping at source, we can’t trust downstream analysis.” These frameworks support modular extensions—including imaging and genomics—but achieving regulatory-grade outputs still requires local governance, provenance tracking, and quality assurance. As one respondent summarized, “Standards make it possible to talk to each other—but trust comes from what happens inside the walls of each institution.”
4.3 Local Variability and Scalable Hybrid Models. Even with strong standards and maturing AI, variability in documentation remains a critical barrier. Clinician shorthand, abbreviations, and note styles differ across hospitals and departments, complicating generalization. Felix Nensa remarked, “AI-driven solutions are already capable of transforming text into standardized formats, but the problem is that no two hospitals document the same way.” The challenge multiplies in multinational studies, were language diversity limits model portability. Most NLP systems are still trained on English corpora, though new initiatives are expanding multilingual medical datasets and benchmarks.15,16 Until such resources mature, scalability across global trial networks will remain constrained.
To address these differences, respondents envisioned hybrid pipelines combining AI extraction, standards-based mapping, and layered validation. In these models, AI identifies relevant signals from text or images; outputs are mapped to FHIR for trial operations and OMOP for analytical use; and rules, cross-model checks, and human review safeguard quality. Lars Fransson summarized, “We want to get rid of legacy workflows, not just add AI on top. The real progress comes when the workflow is redefined to capture structured and unstructured data together from the start.” Such hybrid, continuous-validation approaches promise scalable, reproducible data flows that reduce local variability while maintaining trust and regulatory compliance.
5. Validation, trust, and regulatory readiness
If technology can structure unstructured data, the central question is whether those outputs can meet the stringent validation standards required for randomized clinical trials. Unlike structured domains such as labs or vitals—where validation processes are well established—AI-derived outputs must demonstrate auditability, traceability, and reproducibility to the same degree. Respondents emphasized that while technical capability is advancing rapidly, building trust in these outputs will determine whether unstructured data can truly be considered regulatory-grade.
5.1 Roles and Regulatory Expectations. Respondents agreed that not all data serve the same regulatory purpose. Thomas Metcalfe observed, “If it’s central to the clinical hypothesis about the effectiveness and safety of a medicine, then it needs to be validated all the way back to source. But there’s a lot of secondary information which can be extremely useful without having that level of validity.” Exploratory analyses, feasibility work, and recruitment may tolerate more flexible validation than primary efficacy endpoints.
Both the FDA and EMA have begun to acknowledge the role of AI and NLP in deriving structured data from clinical sources. Yet formal validation frameworks remain nascent. Chris Harrison explained, “Whatever the agencies say in terms of regulating AI and NLP, validation steps have to ensure traceability. It’s still early days in determining what that path is going to look like for unstructured data.” The FDA’s Good Machine Learning Practice (GMLP) principles and the EMA’s Big Data Steering Group are early attempts to set expectations around documentation, transparency, and continuous monitoring.
These initiatives illustrate a shift toward a lifecycle view of validation—one that treats AI models as continuously learning systems rather than static tools. Institutions that document provenance, monitor algorithm performance, and embed human oversight will be best positioned to comply as these standards mature.
5.2 Validation Models Across Sites. A key debate concerned where validation should occur—centrally, locally, or through hybrid models. Lars Fransson cautioned, “If you need a human in the loop for every data point, the time savings may be limited. We want AI to be 100% reliable, but until then, we need mechanisms to validate results without adding more burden at the site.” Local review ensures contextual accuracy but can limit scalability, while fully centralized checks risk missing site-level variation.
Felix Nensa highlighted the ongoing challenge of inconsistency: “AI-driven solutions are already capable of transforming text into standardized formats, but variability in note-taking and imaging complicates automation.” As previously reported in the Applied Clinical Trials oncology study, “every site documents things differently. Standardizing EHR fields for key oncology data points would eliminate much of the manual effort needed for data abstraction.”
Pragmatic approaches often blend automation with human judgment. Richard Yeatman explained, “As long as we are in highly regulated industries, the only realistic way to use AI is with humans in the middle. The CRC’s role will evolve into mastering these tools and making the final call.” This “fighter-pilot with digital copilot” model encapsulates how human review will remain indispensable until AI achieves consistent, validated reliability across diverse hospital settings.
5.3 Continuous and Federated Validation Pathways. Safety monitoring emerged as the most practical early use case for AI-structured unstructured data. Felix Nensa noted, “Clinical notes are one of the few near real-time data sources we have, and they can capture side effects the moment they occur.” Because all safety signals are ultimately confirmed through pharmacovigilance, this domain faces fewer regulatory barriers and provides an immediate opportunity to prove value.
Looking ahead, several experts advocated for continuous validation embedded at the institutional level rather than ad hoc, trial-by-trial checks. Embedding AI-driven structuring in EHR workflows could generate “regulatory-ready” data as a by-product of care. Rich Yeatman observed that “the academic centers have the infrastructure to do this at scale, but that’s not the reality everywhere.” Chris Harrison added, “Smaller sites won’t be able to stand up continuous pipelines on their own—they will need something that comes packaged from vendors or networks.”
This vision points toward federated validation models, where algorithms and performance metrics are maintained centrally but deployed locally with shared oversight. Larger academic medical centers may pioneer continuous pipelines, while smaller hospitals participate through vendor-enabled or sponsor-supported solutions. Over time, such federated networks could harmonize quality standards, minimize redundant checks, and build a foundation of trust for AI-assisted data capture across the clinical-trial ecosystem.
5.4 Toward Continuous and Collaborative Validation. As eSource adoption expands, validation must evolve from static, study-specific checks toward continuous and federated approaches. Rather than verifying each AI or NLP pipeline in isolation, future models will depend on standardized monitoring of performance drift, provenance, and completeness across sites. Larger academic medical centers (AMCs) are likely to pioneer such continuous frameworks, given their data-science capacity and governance infrastructure.
However, smaller or community hospitals should not be excluded from participation. These organizations may leverage vendor-enabled or sponsor-supported validation models, in which algorithms and performance metrics are maintained centrally but deployed locally under shared oversight. This hybrid design ensures that regulatory-grade quality is maintained without imposing disproportionate resource demands.
In time, validation networks coordinated across sponsors, vendors, and healthcare systems could reduce redundancy, improve transparency, and enable enduring trust in AI-assisted data capture across all participating sites. Such an ecosystem would transform validation from a bottleneck into a collaborative, continuously learning process—laying the foundation for scalable, equitable, and trustworthy eSource implementation.
6. Institutional readiness and operational challenges
Even if the technical capability to structure unstructured data exists, progress ultimately depends on the readiness of hospitals and research sites—their digital maturity, staffing capacity, and workflow alignment. Respondents repeatedly described this as the true bottleneck for scaling eSource in routine clinical research.
6.1 Institutional Capabilities and Readiness. Readiness remains uneven across the healthcare ecosystem. Large academic medical centers (AMCs) often have the data-science capacity, IT governance, and informatics expertise to implement structured pipelines and validation frameworks. Many smaller or community hospitals, however, lack such infrastructure. Their participation in eSource-enabled trials will depend on collaborative models—sponsor-supported integration or vendor-delivered validation services that can fit local workflows without heavy investment.
Respondents emphasized inclusivity. Sarah Burge stated, “If we want eSource to scale, we can’t make it a privilege of big centers. Smaller hospitals should be able to participate through standardized frameworks that don’t add complexity.” A sponsor representative agreed that “we need mechanisms that make it as easy for a 200-bed hospital to participate as a major academic site.”
Governance capacity also shapes readiness. Many institutions, particularly smaller ones, lack the resources to navigate data-protection reviews, regulatory templates, and contractual frameworks. Sponsors and vendors can help by providing standardized governance models and incremental onboarding support. Ensuring equity of participation will be critical so that eSource adoption does not deepen existing disparities in research capability.
6.2 Documentation Practices and Evolving Site Roles. Variability in documentation remains a central operational challenge. Gabriel Maetzu observed, “The most natural way to capture a patient’s story is through unstructured data, and if we want to really understand the patient, we have to read through it.” Yet this flexibility sacrifices consistency. Sarah Burge noted in the earlier Applied Clinical Trials study that “every site documents things differently.” Without harmonized workflows or templates, the validation burden falls heavily on trial staff.
At the same time, staff roles are evolving. Richard Yeatman explained, “The CRC’s job will shift to mastering these tools—like a fighter pilot with a digital copilot behind them.” Lars Fransson added a note of caution: “If you need a human in the loop for every data point, the time savings may be limited. We want AI to be reliable enough that review is the exception, not the rule.”
These perspectives underscore that workforce enablement must advance alongside technology. Training clinical research coordinators to collaborate effectively with AI tools—and redesigning documentation workflows to balance narrative richness with structure—will be essential for achieving sustainable, high-quality data capture.
6.3 Operational Governance and Change Enablement. Governance and change management form the connective tissue between readiness and execution. Even when the technology exists, deployment can stall due to lengthy compliance reviews, fragmented data-sharing agreements, or uncertainty about responsibility for validation oversight. Smaller institutions often lack dedicated legal and data-protection staff, delaying implementation.
Addressing these gaps requires streamlined governance frameworks and active change-enablement strategies. Sponsors and vendors can accelerate progress by offering pre-approved templates, shared validation networks, and support for workforce adaptation. Equally important is transparent communication with clinicians and administrators so that digital transformation is done with them, not to them.
Institutional readiness therefore depends on both capability and confidence. Hospitals must trust that eSource systems respect local governance, while sponsors must trust that outputs are reliable. Building this mutual assurance will turn compliance from a reactive burden into a proactive driver of innovation—creating the cultural and procedural conditions for unstructured data to flow securely and efficiently into research.
7. Collaboration across the ecosystem
Activating unstructured data for eSource cannot be achieved by any single stakeholder. Sponsors, sites, vendors, and regulators each control part of the pathway. Respondents consistently stressed that collaboration—especially in pre-competitive and standards-based settings—is essential to move from pilot projects to scalable, trusted adoption.
7.1 Collaborative Roles Across Sponsors, Vendors, and Sites. Pharmaceutical sponsors and technology vendors together form the operational backbone of eSource scale-up. Sponsors create demand, define regulatory requirements, and can invest in site readiness. Chris Harrison emphasized the shared responsibility: “Larger institutions may already have these solutions, but community sites often don’t. That’s where we as sponsors can step in and help provide the synergy to make it possible.”
Vendors translate this demand into practical tools, but success depends on close collaboration with healthcare sites. Too often, clinical software has been imposed rather than co-developed. Sarah Burge cautioned, “Hospitals are cautious about commercial systems, and IT is often done to clinicians rather than with them. For adoption to succeed, solutions must fit clinical workflows and be developed in partnership with staff.”
True progress therefore relies on shared design, not just technology transfer. Sponsors can co-fund implementation, vendors can embed clinician feedback loops, and sites can champion user-centred validation. Together, these aligned roles transform collaboration from transactional to co-creative—establishing the foundation for sustainable, site-ready eSource ecosystems.
7.2 Neutral Platforms and Shared Standards. Respondents repeatedly highlighted the importance of neutral, multi-stakeholder spaces where sponsors, vendors, hospitals, and regulators can work toward common frameworks. The i~HD eSource Task Force was cited as a model for such coordination—aligning technical standards with the operational realities of hospitals. Joeri Holtzem stressed, “This only works if we bring the EHR vendors into the discussion as well. Standards like FHIR and OMOP are important, but adoption depends on EHR systems allowing this in practice.”
Neutral conveners provide a safe environment to address interoperability challenges, validation protocols, and data-governance models without commercial bias. They also enable early harmonization between structured and unstructured domains so that pilots can scale beyond isolated institutions.
By fostering consensus on terminology, metrics, and validation workflows, initiatives like i~HD help transform fragmented innovation into reproducible best practice. Shared standards—when combined with transparent governance—allow sites of all sizes to contribute data confidently and consistently, reinforcing trust across the research ecosystem.
7.3 Building Regulatory Confidence Through Transparency. Collaboration must ultimately extend to regulators. Respondents agreed that early and open dialogue with authorities is crucial for legitimizing AI-enabled eSource workflows. Richard Yeatman explained, “As long as we are in highly regulated industries, the only realistic way to use AI is with humans in the middle. That means we need regulatory frameworks that recognize the human-in-the-loop model as valid.”
Engaging agencies early allows stakeholders to align on acceptable validation thresholds, documentation requirements, and performance-monitoring strategies. Transparency—about both algorithms and governance processes—builds the confidence regulators need to approve AI-derived evidence.
This collaborative transparency also benefits sites and sponsors: it clarifies expectations, reduces redundant audits, and accelerates approval timelines. When regulators, sponsors, and vendors jointly define what “trustworthy automation” looks like, the industry moves from experimentation toward standardized practice. Building regulatory confidence is therefore not the endpoint of collaboration—it is the proof that collaboration works.
8. Use cases and demonstrated impact
Respondents identified several domains where unstructured health data already add measurable value to clinical research. These use cases illustrate both immediate benefit and the pathway toward scalable adoption.
8.1 Recruitment and Feasibility Enhancement. Recruitment remains one of the largest cost drivers in clinical research, and unstructured data can help identify patients meeting complex eligibility criteria not captured in structured fields. Chris Harrison explained, “Anything that can help satisfy qualification for patients on trial is really important. Eligibility is clearly one of the immediate priorities for us.”
By surfacing comorbidities, prior treatments, and disease status in clinical notes, investigators can broaden recruitment pools and reduce screening failures. Beyond recruitment, unstructured sources also improve protocol feasibility by revealing real-world variation in practice. Adriano Garcez noted, “Screening failure is responsible for a big share of early discovery falling apart. Looking at unstructured data upstream, before the trial starts, can make feasibility assessments much more accurate.”
Integrating these insights early in trial design helps sponsors anticipate whether inclusion criteria are realistic, minimizing costly protocol amendments. Together, recruitment and feasibility use cases demonstrate how natural-language and AI tools can align scientific ambition with operational feasibility—saving time, expanding diversity, and improving patient matching.
8.2 Safety and Endpoint Automation. Safety monitoring and imaging endpoints are among the most promising early applications of unstructured data. Felix Nensa emphasized, “Clinical notes are one of the few near real-time data sources we have, and they can capture side effects the moment they occur.” Because all safety signals are ultimately confirmed through pharmacovigilance, this use case carries lower regulatory risk and provides immediate clinical value.
Imaging and pathology data are equally transformative. Steve Tolle observed, “We are now at a point where some AI technology can do true volumetric measurements of lesions over time. That could change how tumour response is calculated and open imaging to automated use in trials.” Automating extraction of lesion measurements and biomarkers could streamline oncology endpoints, improve consistency, and reduce manual abstraction.
These safety- and imaging-related domains show how AI can complement—not replace—clinical oversight. Early integration of validated automation into trial workflows enhances patient safety while accelerating endpoint evaluation.
8.3 Demonstrated Impact and Early Wins. Across all cases, the pattern is clear: unstructured data already deliver tangible gains in efficiency, data richness, and patient safety. Recruitment is faster and more inclusive, feasibility assessments are more reliable, safety monitoring becomes near real-time, and imaging endpoints are more consistent.
These early wins demonstrate that eSource is not a distant aspiration but an operational reality when technical and governance foundations align. Each successful pilot builds confidence among sites, sponsors, and regulators—laying the groundwork for broader adoption across therapeutic areas. Unstructured data, once viewed as an obstacle, is rapidly becoming a strategic asset for clinical research.
9. Strategic outlook: From pilot to scale
Respondents agreed that using unstructured data in eSource is no longer a question of if, but when and how. Pilots already show tangible value in recruitment, feasibility, safety monitoring, and imaging. The challenge now is scaling these isolated successes into sustainable, repeatable models across institutions, therapeutic areas, and regulatory environments.
9.1 Scaling Continuous and Hybrid Models. Large AMCs are likely to pioneer continuous validation pipelines, embedding AI-enabled structuring into clinical workflows. Yet this model will not be feasible everywhere. Smaller hospitals and community sites often lack dedicated informatics teams and must rely on vendor-supported or sponsor-enabled solutions. Federated validation services—where algorithms are maintained centrally but deployed locally—offer a scalable approach for diverse settings.
Richard Yeatman emphasized, “We’re not going to see one universal model. These tools will need to adapt to site-specific data and workflows, with federated approaches to ease the burden on hospitals.” Hybrid pipelines that combine AI/NLP extraction with FHIR for real-time exchange and OMOP for analytical harmonization represent the most pragmatic way forward.
This dual-track model—innovation led by AMCs, scalability supported by vendors—ensures that eSource progress is inclusive. It turns structuring unstructured data from a research experiment into an operational service model, balancing automation with human oversight while maintaining regulatory-grade traceability.
9.2 Governance, Regulation, and Collaboration. Scaling also requires clear regulatory guidance and transparent collaboration. Chris Harrison observed, “Validation steps have to ensure traceability, but we also need regulators to signal how far we can go with AI-derived outputs.” Early engagement with agencies will define acceptable documentation and performance monitoring for AI-assisted workflows.
Collaboration remains the cornerstone of trust. Joeri Holtzem stressed, “This only works if we bring the EHR vendors into the discussion as well. Standards are important, but adoption depends on EHR systems allowing this in practice.” Joint task forces like i~HD’s eSource initiative provide the neutral ground for regulators, vendors, and sponsors to align.
Safety monitoring stands out as an early demonstration of this alignment. Felix Nensa highlighted, “Clinical notes are one of the few near real-time data sources we have, and they can capture side effects the moment they occur.” Because all signals are confirmed through pharmacovigilance, safety use cases combine patient protection with technical feasibility—a “low-regret” entry point for AI validation in trials. Together, regulatory dialogue, vendor collaboration, and shared validation templates can transform today’s patchwork of pilots into a cohesive, trusted ecosystem.
9.3 The Road Ahead: From Pilots to Continuous Learning Systems. The next decade will determine whether eSource-enabled trials can evolve into continuous learning systems. Adriano Garcez explained, “Screening failure is responsible for a big share of early discovery falling apart. Looking at unstructured data upstream, before the trial starts, can make feasibility assessments much more accurate.”
Sarah Burge added, “We should prefer commodity AI that does 90% well, rather than chasing shiny novelty,” emphasizing scalability and alignment with clinical workflows. Beyond improving data flows, eSource creates synergies across feasibility, recruitment, and trial execution—linking study-design optimization, automated patient identification, and structured EHR-to-EDC data transfer into a single continuous learning cycle.
In the near term (three to five years), structuring of medication data and AI-based adverse-event detection will expand, supported by FHIR, OMOP, and RxNorm frameworks. By the early 2030s, continuous normalization and provenance tracking may reduce manual data abstraction entirely, bringing trial execution closer to real-time.
By 2035, eSource could operate as a continuous evidence ecosystem, where structured and unstructured data flow seamlessly between care and research. Unstructured data—once a barrier—will become a validated source of clinical insight, enabling trials that are faster, safer, and more representative of real-world care.
10. Conclusion
This study confirms that the transformation of unstructured health data into regulatory-grade, research-ready evidence is both technically feasible and institutionally dependent. Across stakeholders, there is broad consensus that AI-assisted structuring, combined with standards such as FHIR and OMOP, can substantially reduce manual effort and improve data quality. Early successes in safety monitoring, medication data, and imaging endpoints demonstrate immediate value while providing templates for broader application.
However, scaling eSource beyond pilots requires sustained investment in validation frameworks, site readiness, and cross-sector collaboration. Hospitals need governance and workflow support; sponsors and vendors must coordinate around transparent validation templates; and regulators should continue clarifying expectations for AI-derived data. These elements together will determine how quickly unstructured data can move from technical possibility to operational norm.
As validation frameworks mature, continuous and federated approaches—where performance monitoring, provenance tracking, and quality assurance occur in real time—will become central to trust. Ultimately, the integration of unstructured data will enable continuous evidence generation, linking care and research in a single feedback loop. When achieved, eSource will evolve from a digitization strategy into the foundation of a learning health ecosystem, accelerating discovery while improving patient outcomes.
Mats Sundgren, European Institute of Innovation through Health Data, i-HD; Sarah Burge, Cambridge University Hospitals NHS Foundation Trust; Lars Fransson, AstraZeneca; Adriano Garcez, ZS Associates; Joeri Holtzem, Johnson & Johnson; Gabi Maetzu, IOMED; Thomas Metcalfe, Evidentli; Steve Tolle, IgniteData; Felix Nensa, University Hospital of Essen; and Joe Lengfellner, MSK
Acknowledgements: The authors thank Chris Harrison (AstraZeneca) and Richard Yeatman (IgniteData) for their valuable time and insights during the interviews for this study.
References
1. Kalankesh, L.R.; Monaghesh, E. Utilization of EHRs for Clinical Trials: A Systematic Review. BMC Med. Res. Methodol. 2024. 24 (70). https://bmcmedresmethodol.biomedcentral.com/arti-cles/10.1186/s12874-024-02177-7
2. Sundgren, M.; Mistry, R.; Maeztu, G. Harnessing Unstructured Data and Hospital Interoperability. Applied Clinical Trials. 2024. 36 (9), 22-26. https://www.appliedclinicaltrialsonline.com/view/harnessing-unstructured-data-and-hospital-interoperability
3. Ammour, N.; Griffon, N.; Djadi-Prat, J.; et al. TransFAIR Study: A European Multicenter Experimental Comparison of EHR2EDC Technology to the Usual Manual Method for eCRF Data Collection. BMJ HCI. 2023. 30 (1). https://informatics.bmj.com/content/30/1/e100602
4.Tufts CSDD Impact Report (2022) https://www.clinicaltrialvanguard.com/wp-content/uploads/2024/08/Mar-Apr-2022-Clinical-Trial-Budgets.pdf?utm_source=chatgpt.com
5. Sedlakova J, Sloup P, Rohn K, Sedlak P. Challenges and best practices for digital unstructured data enrichment in health research. BMC Med Res Methodol. 2023; 23:222. doi:10.1186/s12874-023-02022-4.
6. Sundgren M, Andrews L, Burge S, Bush M, Fritsche A, Nensa F, and Lengfellner J. (2025) Scaling eSource-Enabled Clinical Trials: Challenges, Opportunities, and Strategic Outlook for Oncology Research Centers. Applied Clinical Trials. 37, 22-28. https://www.appliedclinicaltrialsonline.com/view/scaling-esource-enabled-clinical-trials-hospital-perspectives
7. Hamidi M, Eisenstein E.L, Garza M.Y, Morales K.J.T., Edwards E.M, Rocca M, et al. (2024). Source Data Verification (SDV) quality in clinical research: A scoping review. J Clin Transl Sci. May 21;8(1):e101. doi:10.1017/cts.2024.551
8. Ehidiamen A.J & Oladapo O.O. (2024). “Innovative approaches to risk management in clinical research: Balancing ethical standards, regulatory compliance and intellectual property concerns.” World Journal of Biology Pharmacy and Health Sciences.20(1):349-363. DOI:10.30574/wjbphs.2024.20.1.0791
9. Adamson B, Waskom M., Blarre A, Kelly J, Krismer K, Nemeth S, et al. (2023) Approach to machine learning for extraction of real-world data variables from electronic health records. Frontiers in Pharmacology. 2023 Sep 15; 14:1180962. doi:10.3389/fphar.2023.1180962
10. Chakrabarty N, Mahajan A. (2024) Imaging Analytics using Artificial Intelligence in Oncology: A Comprehensive Review. Clinical Oncology (R Coll Radiol). Aug;36(8):498-513. DOI: 10.1016/j.clon.2023.09.013.
11.International Data Corporation (IDC). The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things. EMC/IDC Digital Universe Study. Framingham, MA: IDC; 2014.
12. Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst. 2014; 2:3. doi:10.1186/2047-2501-2-3.
13. Zhao X, et al. Integrating real-world data to accelerate and guide drug development: A clinical pharmacology perspective. Clin Transl Sci. 2022 Oct;15(10):2293-2302. doi:10.1111/cts.13391.
14. Seinen T.M.; Kors J.A.; Mulligen E.M.;Tijmbeek P.R. Using Structured Codes and Free-Text Notes to Measure Information Complementarity in Electronic Health Records: Feasibility and Validation Study.2025, Feb 13;27. Med Internet Res.doi: 10.2196/66910
15. Névéol, A., Dalianis, H., Velupillai, S., Savova, G., & Zweigenbaum, P. (2018). Clinical Natural Language Processing in languages other than English: opportunities and challenges. Journal of Biomedical Semantics, 9(1), 12. https://doi.org/10.1186/s13326-018-0179-8
16. Qiu, X., Zhang, H., Lin, J., & Gao, J. (2024). Towards Multilingual Clinical Natural Language Processing: Benchmarks, Resources, and Models. Journal of Biomedical Informatics, 150, 104720. https://doi.org/10.1016/j.jbi.2024.104720