Can Large Language Models Accurately Assess Risk of Bias in Randomized Clinical Trials?


Survey study shows the importance of balancing the delivery of timely systematic reviews while maintaining quality in clinical trials.

© putilov_denis - © putilov_denis -

Image Credit: © putilov_denis -

Systematic reviews are a key element of clinical research. While these reviews must be completed in a timely manner, there is also a crucial demand for quality. Balancing the two can be a challenge, especially when considering risk of bias (ROB) and the process of objectively assessing methodological flaws.1

A survey study recently published on JAMA Network Open sought to discover whether there was a way to reliably and efficiently assess ROB in randomized clinical trials (RCTs). In particular, the study evaluated large language models (LLMs) and their accuracy in assessing ROB.

“LLMs may facilitate the labor-intensive process of systematic reviews. However, the exact methods and reliability remain uncertain,” the study authors wrote.

The study used two LLMs—ChatGPT and Claude—to assess ROB in 30 RCTs selected from systematic reviews. Three experts also assessed the RCTs to compare with the validity of the LLMs. Each RCT was assessed twice by both LLMs and the results were then compared with an assessment by the three experts. The survey study was conducted between August 10, 2023, and October 30, 2023.

A modified version of the Cochrane ROB tool developed by the CLARITY group at McMaster University was used to create the prompt used by the LLMs. The tool has 10 domains—random sequence generation; allocation concealment; blinding to patients, health care clinicians, data collectors, outcome assessors, and data analysts; and missing outcome data, selective outcome reporting, and other concerns.

Following the study, results showed that both models demonstrated high correct assessment rates. According to the authors, ChatGPT reached a mean correct assessment rate of 84.5% (95% CI, 81.5%-87.3%), and Claude reached a rate of 89.5% (95% CI, 87.0%-91.8%). The consistent rates between the two assessments were 84.0% for ChatGPT and 87.3% for Claude.

“In this survey study, we established a structured and feasible prompt that was capable of guiding LLMs in assessing the ROB in RCTs. The LLMs used in this study produced assessments that were very close to those of experienced human reviewers,” the authors said of the study outcome. “Automated tools in systematic reviews exist but are underused due to difficult operation, poor user experience, and unreliable results. In contrast, both LLMs had high accessibility and user friendliness, demonstrating outstanding reliability and efficiency, thereby showing substantial potential for facilitating systematic review production.”

In most domains of the Cochrane tool, domain-specific correct rates ranged from around 80% to 90%. However, sensitivity below 0.80 was observed in domains 1 (random sequence generation), 2 (allocation concealment), and 6 (other concerns).

“To our knowledge, this study is the first to transparently explore the feasibility of applying LLMs to the assessment of ROB in RCTs. The study addressed multiple aspects of the feasibility of LLM use, including accuracy, consistency, and efficiency. A detailed and structured prompt was proposed and performed commendably in practical application,” the authors wrote. “Our findings preliminarily suggest that with an appropriate prompt, LLM 1 (ChatGPT) and LLM 2 (Claude) can be used alongside the modified Cochrane tool to assess the ROB of RCTs accurately and efficiently.”

Within the limitations of the study, the authors noted that the LLMs may not be capable of completing the review independently. However, providing the LLMs access links to external sources may strengthen their ability. This feature was only available in a beta testing version during the study, so the authors did not use it.

“In this survey study of the application of LLMs to the assessment of ROB in RCTs, we found that LLM 1 (ChatGPT) and LLM 2 (Claude) achieved commendable accuracy and consistency when directed by a structured prompt,” the authors concluded. “By scrutinizing the rationale provided and comparing multiple assessments across different models, researchers were able to efficiently identify and correct nearly all errors.”


1. Lai H, Ge L, Sun M, et al. Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models. JAMA Netw Open. 2024;7(5):e2412687. doi:10.1001/jamanetworkopen.2024.12687

Recent Videos
Related Content
© 2024 MJH Life Sciences

All rights reserved.