ChatGPT shines in medical summary task, struggles with field-specific relevance

Download PDF Copy

By Vijay Kumar MalesuReviewed by Susha Cheriyedath, M.Sc.Mar 28 2024

In a recent study published in The Annals of Family Medicine, a group of researchers evaluated Chat Generative Pretrained Transformer (ChatGPT)'s efficacy in summarizing medical abstracts to aid physicians by providing concise, accurate, and unbiased summaries amidst the rapid expansion of clinical knowledge and limited review time.

Study: Quality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts. Image Credit: PolyPloiid / Shutterstock

Background

In 2020, nearly a million new journal articles were indexed by PubMed, reflecting the rapid doubling of global medical knowledge every 73 days. This growth, coupled with clinical models prioritizing productivity, leaves physicians little time to keep up with literature, even in their own specialties. Artificial Intelligence (AI) and natural language processing offer promising tools to address this challenge. Large Language Models (LLMs) like ChatGPT, which can generate text, summarize, and predict, have gained attention for potentially aiding physicians in efficiently reviewing medical literature. However, LLMs can produce misleading, non-factual text or "hallucinate" and may reflect biases from their training data, raising concerns about their responsible use in healthcare.

About the study

In the present study, researchers selected 10 articles from each of the 14 journals, including a broad range of medical topics, article structures, and journal impact factors. They aimed to include diverse study types while excluding non-research materials. The selection process was designed to ensure that all articles published in 2022 were unknown to ChatGPT, which had been trained on data available until 2021, to eliminate the possibility of the model having prior exposure to the content.

The researchers then tasked ChatGPT with summarizing these articles, self-assessing the summaries for quality, accuracy, and bias, and evaluating their relevance across ten medical fields. They limited summaries to 125 words and collected data on the model's performance in a structured database.

Physician reviewers independently evaluated the ChatGPT-generated summaries, assessing them for quality, accuracy, bias, and relevance with a standardized scoring system. Their review process was carefully structured to ensure impartiality and a comprehensive understanding of the summaries' utility and reliability.

The study conducted detailed statistical and qualitative analyses to compare the performance of ChatGPT summaries against human assessments. This included examining the alignment between ChatGPT's article relevance ratings and those assigned by physicians, both at the journal and article levels.

Study results

The study utilized ChatGPT to condense 140 medical abstracts from 14 diverse journals, predominantly featuring structured formats. The abstracts, on average, contained 2,438 characters, which ChatGPT successfully reduced by 70% to 739 characters. Physicians evaluated these summaries, rating them highly for quality and accuracy and demonstrating minimal bias, a finding mirrored in ChatGPT's self-assessment. Notably, the study observed no significant variance in these ratings when comparing across journals or between structured and unstructured abstract formats.

Despite the high ratings, the team did identify some instances of serious inaccuracies and hallucinations in a small fraction of the summaries. These errors ranged from omitted critical data to misinterpretations of study designs, potentially altering the interpretation of research findings. Additionally, minor inaccuracies were noted, typically involving subtle aspects that did not drastically change the abstract's original meaning but could introduce ambiguity or oversimplify complex outcomes.

A key component of the study was examining ChatGPT's capability to recognize the relevance of articles to specific medical disciplines. The expectation was that ChatGPT could accurately identify the topical focus of journals, aligning with predefined assumptions about their relevance to various medical fields. This hypothesis held true at the journal level, with a significant alignment between the relevance scores assigned by ChatGPT and those by physicians, indicating ChatGPT's strong ability to grasp the overall thematic orientation of different journals.

However, when evaluating the relevance of individual articles to specific medical specialties, ChatGPT's performance was less impressive, showing only a modest correlation with human-assigned relevance scores. This discrepancy highlighted a limitation in ChatGPT's ability to accurately pinpoint the relevance of singular articles within the broader context of medical specialties despite a generally reliable performance on a broader scale.

Further analyses, including sensitivity and quality assessments, revealed a consistent distribution of quality, accuracy, and bias scores across individual and collective human reviews as well as those conducted by ChatGPT. This consistency suggested effective standardization among human reviewers and aligned closely with ChatGPT's assessments, indicating a broad agreement on the summarization performance despite the challenges identified.

Conclusions

To summarize, the study's findings indicated that ChatGPT effectively produced concise, accurate, and low-bias summaries, suggesting its utility for clinicians in quickly screening articles. However, ChatGPT struggled with accurately determining the relevance of articles to specific medical fields, limiting its potential as a digital agent for literature surveillance. Acknowledging limitations such as its focus on high-impact journals and structured abstracts, the study highlighted the need for further research. It suggests that future iterations of language models may offer improvements in summarization quality and relevance classification, advocating for responsible AI use in medical research and practice.

Journal reference:

Joel Hake, Miles Crowley, Allison Coy, et al. Quality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts, The Annals of Family Medicine (2024), DOI: 10.1370/afm.3075, https://www.annfammed.org/content/22/2/113

Posted in: Device / Technology News | Medical Science News | Medical Research News

Comments (0)

Written by

Vijay Kumar Malesu

Vijay holds a Ph.D. in Biotechnology and possesses a deep passion for microbiology. His academic journey has allowed him to delve deeper into understanding the intricate world of microorganisms. Through his research and studies, he has gained expertise in various aspects of microbiology, which includes microbial genetics, microbial physiology, and microbial ecology. Vijay has six years of scientific research experience at renowned research institutes such as the Indian Council for Agricultural Research and KIIT University. He has worked on diverse projects in microbiology, biopolymers, and drug delivery. His contributions to these areas have provided him with a comprehensive understanding of the subject matter and the ability to tackle complex research challenges.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Kumar Malesu, Vijay. (2024, March 28). ChatGPT shines in medical summary task, struggles with field-specific relevance. News-Medical. Retrieved on August 21, 2025 from https://www.news-medical.net/news/20240328/ChatGPT-shines-in-medical-summary-task-struggles-with-field-specific-relevance.aspx.
MLA
Kumar Malesu, Vijay. "ChatGPT shines in medical summary task, struggles with field-specific relevance". News-Medical. 21 August 2025. <https://www.news-medical.net/news/20240328/ChatGPT-shines-in-medical-summary-task-struggles-with-field-specific-relevance.aspx>.
Chicago
Kumar Malesu, Vijay. "ChatGPT shines in medical summary task, struggles with field-specific relevance". News-Medical. https://www.news-medical.net/news/20240328/ChatGPT-shines-in-medical-summary-task-struggles-with-field-specific-relevance.aspx. (accessed August 21, 2025).
Harvard
Kumar Malesu, Vijay. 2024. ChatGPT shines in medical summary task, struggles with field-specific relevance. News-Medical, viewed 21 August 2025, https://www.news-medical.net/news/20240328/ChatGPT-shines-in-medical-summary-task-struggles-with-field-specific-relevance.aspx.