AI outperforms doctors in summarizing health records, study shows

NewsGuard 100/100 Score

In a recent study published in the journal Nature Medicine, an international team of scientists identified the best large language models and adaptation methods for clinically summarizing large amounts of electronic health record data and compared the performance of these models to that of medical experts.

Study: Adapted large language models can outperform medical experts in clinical text summarization. Image Credit: takasu / ShutterstockStudy: Adapted large language models can outperform medical experts in clinical text summarization. Image Credit: takasu / Shutterstock


A laborious but essential aspect of medical practice is the documentation of patient medical health records containing progress reports, diagnostic tests, and treatment history across specialists. Clinicians often spend a substantial portion of their time compiling vast amounts of textual data, and even with very experienced physicians, this process presents a possibility of introducing errors, which can translate to serious medical and diagnostic problems.

The transition from paper records to electronic health records only seems to have expanded the workload of clinical documentation, and reports suggest that clinicians spend approximately two hours each documenting the clinical data from their interactions with one patient. Nurses spend close to 60% of their time in clinical documentation, and the temporal demands of this process often result in considerable stress and burnout, decreasing job satisfaction among clinicians and eventually resulting in worse patient outcomes.

Although large language models present an excellent option for the summarization of clinical data, and these models have been evaluated for general natural language processing tasks, their efficiency and accuracy in summarizing clinical data have not been evaluated extensively.

About the study

In the present study, the researchers evaluated eight large language models across four clinical summarization tasks, namely, patient questions, radiology reports, dialogue between doctor and patient, and progress notes.

They first used quantitative natural language processing metrics to determine which model and adaptation method performed the best across the four summarization tasks. Ten physicians then conducted a clinical reader study where they compared the best summaries from the large language models with those from medical experts along parameters such as conciseness, correctness, and completeness.

Finally, the researchers assessed the safety aspects to determine the challenges, such as the fabrication of information and the potential for medical harm present in the summarization of clinical data by medical experts and large language models.

Two broad language-generation approaches — autoregressive and seq2seq models — were used to evaluate the eight large language models. Training seq2seq models requires paired datasets as they use an encoder-decoder architecture that maps the input to the output. These models perform efficiently in tasks involving summarization and machine translation.

On the other hand, autoregressive models do not require paired datasets, and these models are suitable for tasks such as dialogue and question-answer interactions and text generation. The study evaluated open-sourced autoregressive and seq2seq large language models, as well as some proprietary autoregressive models and two techniques for adapting the general-purpose, pre-trained large language models to perform domain-specific tasks.

The four areas of tasks used to evaluate the large language models consisted of summarization of radiology reports using detailed data of radiology analyses and results, summarization of questions from patients into condensed queries, using progress notes to produce a list of medical problems and diagnoses, and summarizing interactions between the doctor and patient into a paragraph on the assessment and plan.


The results showed that 45% of the summaries from the best-adapted large language models were equivalent to and 36% of them were superior to those from medical experts. Furthermore, in the clinical reader study, the large language model summaries scored higher than the medical expert summaries across all three parameters of conciseness, correctness, and completeness.

Furthermore, the scientists found that ‘prompt engineering’ or the process of tuning or modifying the input prompts greatly improved the performance of the model. This was apparent, especially along the conciseness parameter, where specific prompts instructing the model to summarize patient questions into queries of specific word counts were helpful in meaningfully condensing the information.

Radiology reports were the one aspect where the conciseness of the large language model summaries was lower than that of medical experts, and the scientists predicted that this could be due to the vagueness of the input prompt since the prompts for summarizing the radiology reports did not specify the word limit. However, they also believe that incorporating checks from other large language models or model ensembles, as well as from human operators, can greatly improve the accuracy of this process.


Overall, the study found that using large language models to summarize data on patient health records performed as well or better than the summarization of data by medical experts. Most of these large language models scored higher than human operators in the natural language processing metrics, concisely, correctly, and completely summarizing the data. This process can potentially be implemented with further modifications and improvements to help clinicians save valuable time and improve patient care.

Journal reference:
  • Veen, V., Uden, V., Blankemeier, L., Delbrouck, J., Aali, A., Bluethgen, C., Pareek, A., Polacin, M., Reis, E. P., Seehofnerová, A., Rohatgi, N., Hosamani, P., Collins, W., Ahuja, N., Langlotz, C. P., Hom, J., Gatidis, S., Pauly, J., & Chaudhari, A. S. (2024). Adapted large language models can outperform medical experts in clinical text summarization. Nature Medicine. DOI: 10.1038/s41591024028555,
Dr. Chinta Sidharthan

Written by

Dr. Chinta Sidharthan

Chinta Sidharthan is a writer based in Bangalore, India. Her academic background is in evolutionary biology and genetics, and she has extensive experience in scientific research, teaching, science writing, and herpetology. Chinta holds a Ph.D. in evolutionary biology from the Indian Institute of Science and is passionate about science education, writing, animals, wildlife, and conservation. For her doctoral research, she explored the origins and diversification of blindsnakes in India, as a part of which she did extensive fieldwork in the jungles of southern India. She has received the Canadian Governor General’s bronze medal and Bangalore University gold medal for academic excellence and published her research in high-impact journals.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Sidharthan, Chinta. (2024, February 28). AI outperforms doctors in summarizing health records, study shows. News-Medical. Retrieved on April 14, 2024 from

  • MLA

    Sidharthan, Chinta. "AI outperforms doctors in summarizing health records, study shows". News-Medical. 14 April 2024. <>.

  • Chicago

    Sidharthan, Chinta. "AI outperforms doctors in summarizing health records, study shows". News-Medical. (accessed April 14, 2024).

  • Harvard

    Sidharthan, Chinta. 2024. AI outperforms doctors in summarizing health records, study shows. News-Medical, viewed 14 April 2024,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
More steps a day keep the doctor away