Study evaluates the use of AI-supported medical coding module.
Accurate and consistent reporting of medical terms in clinical trials is imperative to ensure data integrity and human subjects protection as decreed by global regulatory agencies. Safety and efficacy of a medical intervention is contingent upon accurate assessment of benefits and harm during a clinical study. To ensure uniformity, subject matter experts have historically manually coded medical terms in data capture systems, using dictionaries such as the Medical Dictionary for Regulatory Activities (MedDRA).1 However, given the increasing complexity of trial protocols and the volume and depth of data, manual coding has become untenable. The challenges of manual coding include deep expertise required for coding, variable linguistic definitions, the large MedDRA vocabulary and its frequent changes, inter-coder subjectivity, and a labor-intensive, time-consuming process. These factors are also compounded by an aging population where as many as four out of ten adults have multiple chronic diseases, which amplifies the need for medical coding scalability and automation and adds to the complexity of applying medical codes accurately and efficiently. Since the transition to ICD-10 coding, a wealth of literature has emerged on implementation and utilization of computer-assisted coding (CAC) from both industry and academic sources.2
The integration of health information technology into clinical workflows has great potential to streamline processes and introduce efficiencies into manual review processes such as medical coding. CAC has been integrated in clinical workflows since the mid 1990s.3 More widespread CAC applications have been available since the early 2000s and their adoption has increased dramatically in recent years with the push towards electronic health records (EHR) and implementation of new coding nomenclatures such as ICD-10 and most recently ICD-11.4 CAC has made it possible to automate the coding process by assigning diagnoses, procedures and adverse events from electronic sources of clinical documentation using natural language processing (NLP) and machine learning (ML).
Many studies and organizations have demonstrated use and efficacy of computer-assisted coding to efficiently and accurately code clinical data. For example, in one such study researchers reviewed 38 journal articles, published dissertations and case studies revealing that CAC demonstrated measurable value in improving clinical coding accuracy and data quality, and in overcoming human misses during the manual medical coding process.4-5 However, in these studies evaluation results were difficult to generalize because no common ground truth was available to measure effectiveness of CAC implementation across different clinical coding workflows.3 In another study researchers conducted a systematic literature review of over 113 published tools to examine the utility and added value of implementing clinical coding modules and found that these systems hold promise but must be considered in context as their performance is relative to the complexity of the task and desired outcomes.6 In one such system evaluation, a NLP approach was used to map SNOMED CT for ICD-10 diagnosis codes with results of 54.1% sensitivity and 70.2% positive predictive value.7
CAC coding engines are typically based on a combination of rules, statistical analyses, and dictionaries to assist in extracting clinical facts and surface insights from EHR data that are then validated by coding professionals. CAC is usually implemented in a way that modifies the review process, providing coders the option to accept or reject codes suggested by the artificial intelligence (AI)-supported coding system. This computer-assisted approach to the coding process has become progressively more common and has been credited with measurable gains in data quality and coder productivity.8-9 CAC systems are typically deployed as a traditional software model where the software either is purchased from a vendor and loaded on a dedicated server hosted by the client organization or is installed as a “software as a service” (SaaS). SaaS deployment has become the more commonly used and preferred approach due to lower operating costs.4
The goal of this study is to evaluate and report on the use of an AI-supported medical coding module, that is integrated with an end-to-end clinical data management system (CDMS), using real-world data from a large North American contract research organization (CRO). Our group previously published on the advantages of an end-to-end CDMS and bottlenecks in clinical trial data management.10-11 The CRO conducts clinical trials in many disease areas and used the AI-supported coding system to code clinical trial adverse events (AEs).
The CDMS is called IBM Clinical Development (IBM CDMS) and is a single instance, multi-user SaaS platform that employs IBM’s (AI)-supported technology. The medical coding module draws on MedDRA and World Health Organization Drug Global (WHODrug Global) dictionaries to suggest appropriate codes that are not automatically coded by the user’s pre-existing coding rules. IBM CDMS’s Medical Coding with AI module is formulated as a hierarchical text classification problem with multiple classes because the MedDRA dictionary has hierarchical semantics. A deep learning multi-task convolutional neural network (CNN) with kernel filtered convolutional layers, pooling layers, and a fully connected layer is embedded to the medical coding module to rank the predicted top-k lowest level term (LLT)s with their associated probability scores. The predictive model comprises standard deep neural network (DNN) components including neurons, hidden layers, drop-out rates, and the Softmax activated function. The model is trained on a large and diverse collection of system-generated auto-coding rules in order to optimize proposed coding suggestions. This study assesses the technical performance, usability/workflow and impact of IBM CDMS’s Medical Coding with AI module in general and specifically for a variety of quality and performance metrics for the task of coding for AEs.
In this mixed-methods study, the CRO’s system usage data from April 9, 2020 through April 7, 2021 were retrospectively analyzed to compare both the number of searches required to arrive at an approved code and the code rejection rate, with and without AI coding system support. Semi-structured qualitative interviews were conducted with system users via recorded phone conference calls to assess perceived system accuracy, efficiency, and how time savings (if any) were repurposed. Informed consent was obtained from each interview participant a priori. Using an open coding approach, interviews were transcribed, de-identified, and analyzed. Thematic analysis was defined by a combination of deductive review of the direct responses to interview questions and inductive analysis to identify emerging patterns. Our study goal was to assess utility of the coding module and perceived system efficiency, accuracy, and usability by trained clinical coders using the system.
We extracted data from IBM CD and analyzed distributions of utilization of the AI-supported coding module and provided descriptive statistics for the study period. We report the number of searches by coding method (manual versus AI-supported), proportion of searches by coding method, and code rejection rates.
The study team conducted semi-structured telephone interviews lasting 60 minutes. All interview participants had a minimum of 7 years of experience in clinical trial medical coding and data management, were currently managing data in one or more clinical trials, and also had a supervisory role for more junior data managers. Conversations were directed via a semi-structured interview guide (see Appendix) to explore participants’ workflow and responsibilities, experience using the tool, and perception of the tool’s impact in their work. Each 60-minute interview was conducted remotely via a secure online meeting platform and audio recorded for transcription into NVivo,12 a qualitative software tool. The data was de-identified and examined using a grounded theory-informed approach to identify relevant themes.
This study was submitted for review by the Western Institutional Review Board (WIRB), which determined that the Study is exempt from human subjects regulations under 45 C.F.R. § 46.104(d)(2).
During the study period a total of 7,965 approved codes (32% AI-supported) from 14 endocrinology clinical trials (79% Phase I, 21% closed) conducted by a single CRO were assessed. Figure 1 below shows cumulative searches by coding method with manual coding in blue compared with AI-supported coding in orange. One important feature of the AI-supported coding module is that codes for medical terms are proposed upon verbatim selection. This study assesses those verbatims that were not automatically coded by the system and required human/manual review. When cumulative searches by coding method were compared, codes for 4.12% of manually coded AEs were approved after a single search, while codes for 72.27% of AI supported AEs were approved after 0 searches and increased to 81% after only 2 searches (Figure 1).
Figure 2 below shows the proportion of searches by coding method. Manually coded AEs required more searches with 56% approved after 3 and 80% after 4 searches. Code rejection rates for each coding method were similar (1.58% manual, 2.43% AI-supported).
Semi-structured interviews lasting 60 minutes were conducted via phone with coding system users as described above (n=3). Interviewed users stated the AI coding system improved efficiency and decreased work effort while maintaining requisite accuracy. Additionally, the coding module integration to the IBM CDMS was deemed intuitive and seamless.
Participants’ work responsibilities spanned many data-intensive steps, from creation and maintenance of databases to training and managing medical coding teams. Each participant manually reviewed the items which did not receive automatic coding for their studies on a weekly basis. Participants also identified data as “the critical element” of their work, with several directly stating accuracy as their paramount goal in the data management process, even at the cost of increased time and effort. As one participant noted:
“We want to be as accurate as possible. ‘Cause the regulatory agencies are the ultimate end users. And the sponsors that are trying to get their drugs approved.”
The premium placed on data quality directly benefits regulatory agencies, who rely heavily on the accuracy and validity of clinical trials evidence to inform their policy decisions. Given this priority on data quality, the small portion of coder job duties spent on manual coding and review can often be time-intensive and mentally taxing. The focus needed to ensure completeness and representativeness of the coding, as well as the large number of MedDRA codes that must be focused on simultaneously and wholistically, contribute to their cognitive load.
“Take for example, surgery for a broken bone subsequent to a fall. Is the fall related to a prior, existing medical condition? Is the fall the important part or the broken bone? Or the surgery?Adverse event—Did the trial drug make them dizzy and they fell?”
The subsequent findings of this qualitative study are demonstrated to be consistent with this overarching goal of data accuracy and provide further explanation of the priorities of the coding staff.
When asked to discuss their experiences using the AI-supported tool for manual coding, three overarching themes emerged: supportive interface, effort, and time.
All participants described the tool as intuitive to use and well-integrated into the existing system interface. They were able to review and refine the codes without the need to navigate to another window or a separate application.
This is consistent with our goal in building the interface to “suggest” rather than rank or direct the coder. AI-supported systems are sometimes incorrect, and it was our goal to maintain the user’s perception that they remained in control. The system provided suggestions rather than definitive codes in an effort to avoid being too persuasive which could result in re-working or reducing user trust in the system.
Participants also noted the tool helped them by providing them a curated list of potential labels, which significantly decreased effort needed to interpret the medical text and determine the correct
MedDRA code. The high underlying accuracy made disagreements with the system exceptionally rare with each participant recalling three or fewer instances of disagreement. Additionally, even when the participants did disagree with the suggested code, the tool still provided the benefit of a limited search space, reducing the number of alternative considerations they had to entertain.
Finally, participants discussed how the combination of the intuitive usability and reliable accuracy resulted in substantial time savings. Every participant estimated the tool helped them cut down their time spent manually coding by one-half or more. This was a valuable time savings as one participant noted “All the time we don’t spend coding is the value”. Again, the coders saw the accuracy of the data as their highest priority, so this capability to save time is explicitly dependent on the ability of the system to perform as desired.
In this study we demonstrate user interaction with the system and usability. We learned from this study that user impression of system accuracy is highly relevant with regard to system use and implementation. The interview data we present in this study provide a real-world example for implementation of AI-supported medical coding in one CRO. In future studies with other partner organizations, we will broaden our inclusion criteria for interviews and/or surveys, demonstrating the generalizability of the findings we have presented to a less-senior workforce. In addition, the quantitative portion of the study focused solely on adverse events in endocrinology trials, narrowing the repertoire of verbatims to a specific disease. However, it should be noted that the AI-supported medical coding module has been trained on the broad use of IBM CD on over 1,500 clinical trials in a wide-ranging set of treatment settings and phases.
AI systems integrated into coding workflows can increase code approval efficiency while maintaining accuracy and user satisfaction. Our study suggests that an AI-supported medical coding module can aid in selecting the most appropriate codes for AE detection. As technology progresses and machine learning and NLP approaches improve CAC capabilities, the medical coding workflow will further evolve. In future studies we will address the opportunity to evaluate implementation on a larger scale and for more disease conditions and clinical trials.
We acknowledge the generosity of the interview participants and support of the CRO we partnered with to obtain study data.
Brett R. South, MS, PhD; IBM Watson Health, Cambridge, MA, USA, Courtney B. VanHouten, MA; IBM Watson Health, Cambridge, MA, USA, Van Willis, PhD; IBM Watson Health, Cambridge, MA, USA, Walker Bradham; IBM Watson Health, Cambridge, MA, USA, Jennifer Duff, MBA; IBM Watson Health, Cambridge, MA, USA, Rezzan Hekmat, MS; IBM Watson Health, Cambridge, MA, USA, Jane L. Snowdon, PhD; IBM Watson Health, Cambridge, MA, USA, Dilhan Weeraratne, PhD; IBM Watson Health, Cambridge, MA, USA