ChatGPT shines in medical summary task, struggles with field-specific relevance

NewsGuard 100/100 Score

In a recent study published in The Annals of Family Medicine, a group of researchers evaluated Chat Generative Pretrained Transformer (ChatGPT)'s efficacy in summarizing medical abstracts to aid physicians by providing concise, accurate, and unbiased summaries amidst the rapid expansion of clinical knowledge and limited review time.

Study: Quality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts. Image Credit: PolyPloiid / ShutterstockStudy: Quality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts. Image Credit: PolyPloiid / Shutterstock


In 2020, nearly a million new journal articles were indexed by PubMed, reflecting the rapid doubling of global medical knowledge every 73 days. This growth, coupled with clinical models prioritizing productivity, leaves physicians little time to keep up with literature, even in their own specialties. Artificial Intelligence (AI) and natural language processing offer promising tools to address this challenge. Large Language Models (LLMs) like ChatGPT, which can generate text, summarize, and predict, have gained attention for potentially aiding physicians in efficiently reviewing medical literature. However, LLMs can produce misleading, non-factual text or "hallucinate" and may reflect biases from their training data, raising concerns about their responsible use in healthcare. 

About the study 

In the present study, researchers selected 10 articles from each of the 14 journals, including a broad range of medical topics, article structures, and journal impact factors. They aimed to include diverse study types while excluding non-research materials. The selection process was designed to ensure that all articles published in 2022 were unknown to ChatGPT, which had been trained on data available until 2021, to eliminate the possibility of the model having prior exposure to the content.

The researchers then tasked ChatGPT with summarizing these articles, self-assessing the summaries for quality, accuracy, and bias, and evaluating their relevance across ten medical fields. They limited summaries to 125 words and collected data on the model's performance in a structured database. 

Physician reviewers independently evaluated the ChatGPT-generated summaries, assessing them for quality, accuracy, bias, and relevance with a standardized scoring system. Their review process was carefully structured to ensure impartiality and a comprehensive understanding of the summaries' utility and reliability.

The study conducted detailed statistical and qualitative analyses to compare the performance of ChatGPT summaries against human assessments. This included examining the alignment between ChatGPT's article relevance ratings and those assigned by physicians, both at the journal and article levels. 

Study results 

The study utilized ChatGPT to condense 140 medical abstracts from 14 diverse journals, predominantly featuring structured formats. The abstracts, on average, contained 2,438 characters, which ChatGPT successfully reduced by 70% to 739 characters. Physicians evaluated these summaries, rating them highly for quality and accuracy and demonstrating minimal bias, a finding mirrored in ChatGPT's self-assessment. Notably, the study observed no significant variance in these ratings when comparing across journals or between structured and unstructured abstract formats.

Despite the high ratings, the team did identify some instances of serious inaccuracies and hallucinations in a small fraction of the summaries. These errors ranged from omitted critical data to misinterpretations of study designs, potentially altering the interpretation of research findings. Additionally, minor inaccuracies were noted, typically involving subtle aspects that did not drastically change the abstract's original meaning but could introduce ambiguity or oversimplify complex outcomes.

A key component of the study was examining ChatGPT's capability to recognize the relevance of articles to specific medical disciplines. The expectation was that ChatGPT could accurately identify the topical focus of journals, aligning with predefined assumptions about their relevance to various medical fields. This hypothesis held true at the journal level, with a significant alignment between the relevance scores assigned by ChatGPT and those by physicians, indicating ChatGPT's strong ability to grasp the overall thematic orientation of different journals.

However, when evaluating the relevance of individual articles to specific medical specialties, ChatGPT's performance was less impressive, showing only a modest correlation with human-assigned relevance scores. This discrepancy highlighted a limitation in ChatGPT's ability to accurately pinpoint the relevance of singular articles within the broader context of medical specialties despite a generally reliable performance on a broader scale.

Further analyses, including sensitivity and quality assessments, revealed a consistent distribution of quality, accuracy, and bias scores across individual and collective human reviews as well as those conducted by ChatGPT. This consistency suggested effective standardization among human reviewers and aligned closely with ChatGPT's assessments, indicating a broad agreement on the summarization performance despite the challenges identified.


To summarize, the study's findings indicated that ChatGPT effectively produced concise, accurate, and low-bias summaries, suggesting its utility for clinicians in quickly screening articles. However, ChatGPT struggled with accurately determining the relevance of articles to specific medical fields, limiting its potential as a digital agent for literature surveillance. Acknowledging limitations such as its focus on high-impact journals and structured abstracts, the study highlighted the need for further research. It suggests that future iterations of language models may offer improvements in summarization quality and relevance classification, advocating for responsible AI use in medical research and practice.

Journal reference:
  • Joel Hake, Miles Crowley, Allison Coy, et al. Quality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts, The Annals of Family Medicine (2024), DOI:  10.1370/afm.3075,
Vijay Kumar Malesu

Written by

Vijay Kumar Malesu

Vijay holds a Ph.D. in Biotechnology and possesses a deep passion for microbiology. His academic journey has allowed him to delve deeper into understanding the intricate world of microorganisms. Through his research and studies, he has gained expertise in various aspects of microbiology, which includes microbial genetics, microbial physiology, and microbial ecology. Vijay has six years of scientific research experience at renowned research institutes such as the Indian Council for Agricultural Research and KIIT University. He has worked on diverse projects in microbiology, biopolymers, and drug delivery. His contributions to these areas have provided him with a comprehensive understanding of the subject matter and the ability to tackle complex research challenges.    


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Kumar Malesu, Vijay. (2024, March 28). ChatGPT shines in medical summary task, struggles with field-specific relevance. News-Medical. Retrieved on April 12, 2024 from

  • MLA

    Kumar Malesu, Vijay. "ChatGPT shines in medical summary task, struggles with field-specific relevance". News-Medical. 12 April 2024. <>.

  • Chicago

    Kumar Malesu, Vijay. "ChatGPT shines in medical summary task, struggles with field-specific relevance". News-Medical. (accessed April 12, 2024).

  • Harvard

    Kumar Malesu, Vijay. 2024. ChatGPT shines in medical summary task, struggles with field-specific relevance. News-Medical, viewed 12 April 2024,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Top minds in multiple sclerosis to speak at CMSC 38th Annual Meeting