A new review of randomized trials shows that while ChatGPT-style AI can improve education and trust in digestive care, hard evidence from real patients remains in short supply.
Study: Randomized controlled trials evaluating large language models in digestive diseases: a scoping review. Image credit: Pixel-Shot/Shutterstock.com
A new scoping review published in Gastroenterology & Endoscopy examines how randomized controlled trials are being used to test large language models, such as ChatGPT, in the diagnosis and management of digestive diseases, revealing both early promise and significant evidence gaps.
Why digestive disease is testing AI cautiously
Digestive disorders cause disease and even death for billions of people around the world. Their diagnosis often follows a long and complex path, and treatment is similarly intricate, integrating clinical, imaging, and tissue-level data obtained through biopsy. Artificial intelligence (AI) could shorten this period and streamline the process, potentially improving accuracy and enhancing patient communication.
Large language models (LLMs), such as ChatGPT, are a type of AI that processes vast volumes of text data to produce human-like language outputs. Since the public release of ChatGPT in late 2022, LLMs have made extensive inroads into all forms of communication.
In healthcare, LLMs may help impart medical education and enhance the speed and quality of diagnosis, patient education, and treatment. They could also make documentation and administrative tasks easier.
However, LLMs currently suffer from potentially dangerous flaws, including hallucinations, incoherent or inaccurate text, and unreliable outputs; the use of biased algorithms leading to inequitable decision-making; and issues related to data privacy and security. This makes it all the more important to prove that their use in healthcare is safe and useful for actual patient outcomes or that it provides better, less expensive, and safer healthcare.
Unlike the case with medical education or technical tasks like classification, hard evidence from randomized controlled trials (RCTs) is required to evaluate the contribution of LLMs to actual patient care.
The current study is a scoping review of RCTs assessing the use of LLMs in the diagnosis and treatment of digestive disorders using actual patient data and examining the performance of specified tasks by specific LLMs or algorithms, along with the type of trial design and the kind of results reported. In contrast, most earlier studies stopped with examining the role of LLMs in improving medical knowledge in this field, such as the score achieved on medical licensing examination questions.
A global snapshot of digestive AI trials
The study included 14 RCTs, either ongoing or published, of which only four involved real patients. The trials largely took place in China and the United States, and were mostly confined to single centers, with a median sample size of 258. Most dealt only with gastrointestinal diseases, and several with hepatobiliary conditions.
Five areas of healthcare were identified in this field:
- Making clinical decisions
- Patient communication
- Health-related communication
- Medical education
- Patient education
Natural language processing (NLP) tasks examined in this study included:
- Classification
- Conversations with the patient
- Answering questions
- Summarizing or simplifying information
The outcome most often measured related to managing the patient’s care, with several focusing on the patient experience or on professional competence.
AI boosts trust and education, not outcomes
The study found that a variety of models were used in this field, both general-purpose LLMs like ChatGPT and those designed for a specific domain. The latter included ScreenTalk, an AI application designed to promote colorectal cancer screening among individuals whose first-degree relatives had the condition, and the Voice-Assisted Remote Symptom Monitoring System (VARSMS), which helps patients undergoing surgery for gut cancers during their postoperative period.
Domain-specific LLMS may outperform general-purpose heavyweight models, such as GPT-4, in certain areas, notably those that do not involve answering (especially open-ended) questions or require summary generation or data simplification.
This could lead to the development of more specific, computationally efficient, medical LLMs for each task rather than increasingly powerful general-purpose models. Even now, multimodal LLMs that make use of many different sources and types of patient data are being evaluated through RCTs to provide more well-rounded recommendations, promoting precision medicine.
The most frequently encountered study design compared LLM-assisted care with unassisted approaches in terms of the selected outcome. A few compared clinician care or routine clinical care with LLM-assisted care. Most ongoing trials focused on patient education and clinical decision-making.
The researchers found that LLMs were mostly used to help make clinical decisions and to educate patients. In an ongoing trial, an LLM called GutGPT was developed to assist in the care of patients with upper gastrointestinal bleeding.
This was used to generate care recommendations based on accepted guidelines. To achieve this, it combined modeling to estimate patient risk with LLM-based guidance on clinical decisions.
This was tested in a two-phase RCT. The interim results were included in the current study. Overall, LLMs improved patient trust and acceptance of healthcare technology and enhanced medical understanding of educational content.
NLP was mainly used to answer questions. Further study should identify the utility of these tools for the other tasks.
The small number of studies on this topic, coupled with the preliminary nature of many of them, limits generalizability and relevance, emphasizing the need for future research. Additionally, several RCTs did not report following established reporting guidelines.
Bias risk was assessed to be significant due to randomization flaws, failure to follow protocol, imperfect outcome measurements, and result reporting bias.
However, prior studies suggest that RCTs are key to identifying the real value of LLMs in medicine. Beyond simply offering medical education and answering questions, they could also handle tasks such as evaluating participants’ knowledge, summarizing documents, drafting responses, or identifying topics that require further research.
Future RCTs to evaluate medical LLM use should address questions about how LLM performance is to be systematically measured, when an LLM-based intervention is appropriate, broadening the scope of LLMs in digestive disorders, reducing bias, ensuring proper ethical and regulatory approvals, and the availability of real patient outcomes.
The optimal scenario is one where both patients and providers are satisfied, while workload is reduced and clinical outcomes are improved.
Multicenter trials needed before clinical adoption
LLMs hold promise for the management of digestive diseases. This assumption should be validated by international multicenter RCTs that focus on actual patient outcomes.
Download your PDF copy now!