Can Large Language Models Accurately Assess Risk of Bias in Randomized Clinical Trials?

June 10, 2024

News

Article

Survey study shows the importance of balancing the delivery of timely systematic reviews while maintaining quality in clinical trials.

© putilov_denis - © putilov_denis - stock.adobe.com

Image Credit: © putilov_denis - stock.adobe.com

Systematic reviews are a key element of clinical research. While these reviews must be completed in a timely manner, there is also a crucial demand for quality. Balancing the two can be a challenge, especially when considering risk of bias (ROB) and the process of objectively assessing methodological flaws.¹

A survey study recently published on JAMA Network Open sought to discover whether there was a way to reliably and efficiently assess ROB in randomized clinical trials (RCTs). In particular, the study evaluated large language models (LLMs) and their accuracy in assessing ROB.

“LLMs may facilitate the labor-intensive process of systematic reviews. However, the exact methods and reliability remain uncertain,” the study authors wrote.

The study used two LLMs—ChatGPT and Claude—to assess ROB in 30 RCTs selected from systematic reviews. Three experts also assessed the RCTs to compare with the validity of the LLMs. Each RCT was assessed twice by both LLMs and the results were then compared with an assessment by the three experts. The survey study was conducted between August 10, 2023, and October 30, 2023.

A modified version of the Cochrane ROB tool developed by the CLARITY group at McMaster University was used to create the prompt used by the LLMs. The tool has 10 domains—random sequence generation; allocation concealment; blinding to patients, health care clinicians, data collectors, outcome assessors, and data analysts; and missing outcome data, selective outcome reporting, and other concerns.

Following the study, results showed that both models demonstrated high correct assessment rates. According to the authors, ChatGPT reached a mean correct assessment rate of 84.5% (95% CI, 81.5%-87.3%), and Claude reached a rate of 89.5% (95% CI, 87.0%-91.8%). The consistent rates between the two assessments were 84.0% for ChatGPT and 87.3% for Claude.

“In this survey study, we established a structured and feasible prompt that was capable of guiding LLMs in assessing the ROB in RCTs. The LLMs used in this study produced assessments that were very close to those of experienced human reviewers,” the authors said of the study outcome. “Automated tools in systematic reviews exist but are underused due to difficult operation, poor user experience, and unreliable results. In contrast, both LLMs had high accessibility and user friendliness, demonstrating outstanding reliability and efficiency, thereby showing substantial potential for facilitating systematic review production.”

In most domains of the Cochrane tool, domain-specific correct rates ranged from around 80% to 90%. However, sensitivity below 0.80 was observed in domains 1 (random sequence generation), 2 (allocation concealment), and 6 (other concerns).

“To our knowledge, this study is the first to transparently explore the feasibility of applying LLMs to the assessment of ROB in RCTs. The study addressed multiple aspects of the feasibility of LLM use, including accuracy, consistency, and efficiency. A detailed and structured prompt was proposed and performed commendably in practical application,” the authors wrote. “Our findings preliminarily suggest that with an appropriate prompt, LLM 1 (ChatGPT) and LLM 2 (Claude) can be used alongside the modified Cochrane tool to assess the ROB of RCTs accurately and efficiently.”

Within the limitations of the study, the authors noted that the LLMs may not be capable of completing the review independently. However, providing the LLMs access links to external sources may strengthen their ability. This feature was only available in a beta testing version during the study, so the authors did not use it.

“In this survey study of the application of LLMs to the assessment of ROB in RCTs, we found that LLM 1 (ChatGPT) and LLM 2 (Claude) achieved commendable accuracy and consistency when directed by a structured prompt,” the authors concluded. “By scrutinizing the rationale provided and comparing multiple assessments across different models, researchers were able to efficiently identify and correct nearly all errors.”

Reference

1. Lai H, Ge L, Sun M, et al. Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models. JAMA Netw Open. 2024;7(5):e2412687. doi:10.1001/jamanetworkopen.2024.12687

Related Content

Amgen’s Bemarituzumab-Chemotherapy Combination Significantly Improves Overall Survival in Unresectable Locally Advanced or Metastatic G/GEJ Cancer

Don Tracy, Associate Editor

July 2nd 2025

Article

Results from the Phase III FORTITUDE-101 trial showed that bemarituzumab combined with chemotherapy demonstrated both statistical significance and clinical relevance in previously untreated patients with unresectable locally advanced or metastatic gastric or gastroesophageal junction cancer who are FGFR2b-positive and HER2-negative.

Unifying Industry to Better Understand GCP Guidance

Andy Studna, Senior Editor

May 7th 2025

Podcast

In this episode of the Applied Clinical Trials Podcast, David Nickerson, head of clinical quality management at EMD Serono; and Arlene Lee, director of product management, data quality & risk management solutions at Medidata, discuss the newest ICH E6(R3) GCP guidelines as well as how TransCelerate and ACRO have partnered to help stakeholders better acclimate to these guidelines.

© photon_photo - © photon_photo - stock.adobe.com

Addressing Population Gaps in Therapy Adoption: Using Real-World Data to Bridge the Clinical Trial Divide

Noah Nasser

July 1st 2025

Article

How clinical operations teams can close the gap between controlled trial results and real-world adoption by generating evidence in broader, more representative patient populations.

Unlock Commercial Growth through Data-Driven Patient and HCP Insights Podcast

May 2nd 2025

Podcast

Phase III GEMZ Trial Meets Primary Endpoint for Fenfluramine in CDKL5 Deficiency Disorder

Andy Studna, Senior Editor

June 30th 2025

Article

New data from the pivotal GEMZ study show significant seizure reduction with adjunctive fenfluramine in CDKL5 Deficiency Disorder.

Credit: Crystal light | stock.adobe.com. Human kidneys anatomy structure, kidney disease.

Phase III PROMINENT Trial Initiated to Evaluate Felzartamab for Primary Membranous Nephropathy

Davy James

June 30th 2025

Article

The global Phase III PROMINENT trial has begun dosing patients to evaluate felzartamab in treating primary membranous nephropathy, a serious autoimmune kidney disorder with no FDA-approved therapies.