Human researchers still outperform AI when it comes to writing trustworthy systematic reviews

Despite rapid advances in large language models, this study shows that human expertise remains critical for producing rigorous systematic reviews, with AI best suited as a supervised support tool rather than an independent author.

Study: Human researchers are superior to large language models in writing a medical systematic review in a comparative multitask assessment. Image Credit: Summit Art Creations / Shutterstock.com

A recent study published in the journal Scientific Reports reveals that human researchers perform better than large language models (LLMs) in preparing systematic literature reviews.

What are LLMs?

LLMs are advanced artificial intelligence (AI) systems that use deep learning methods to analyze a vast amount of input data and generate human-like language. Since the introduction of OpenAI’s ChatGPT in 2022, LLMs have gained significant public attention for their ability to perform a wide range of everyday tasks, including text generation, language translation, email writing, and much more.

LLMs have become an integral part of healthcare, education, and research sectors due to their ability to both interpret and generate text. In fact, several studies have demonstrated that LLMs such as GPT-4 and BERT can perform a wide range of medical tasks, including annotation of ribonucleic acid (RNA) sequencing data, content summarization, and medical report drafting.

In scientific research, LLMs have been utilized for literature screening and summarization, data analysis, and report generation. Despite immense potential to accelerate scientific processes, the responsible integration of LLMs in healthcare, education, and research domains requires a comprehensive analysis of potential challenges, including ensuring data consistency, mitigating biases, and maintaining transparency in their applications.

Study design

To elucidate the risks and benefits of integrating LLMs into key scientific areas, the current study investigated whether LLMs outperform human researchers in conducting systematic literature reviews. To this end, six different LLMs were used to perform literature searches, article screening and selection, data extraction and analysis, and the final drafting of the systematic review.

All outcomes were compared with the original systematic review written by human researchers on the same topic. This process was repeated twice to evaluate between-version changes and improvements of LLMs over time. 

Key findings and significance  

In the first task that included literature search and selection, the LLM Gemini performed the best by selecting 13 out of 18 scientific articles that were included in the original systematic review produced by human researchers. Nevertheless, significant limitations in LLMs' ability to perform key tasks were observed, including literature search, data summarization, and final manuscript drafting.

These limitations likely reflect the lack of access that many LLMs have to electronic databases for scientific articles. Additionally, the training datasets used for these models may contain relatively few original research articles, which further reduces their accuracy.

Despite non-satisfactory performance on the first task, LLMs extracted several appropriate articles more quickly than human researchers. Thus, the time-effectiveness of LLMs can be utilized for initial literature screening, alongside the standard cross-search of databases and references by human researchers.     

In the second task for data extraction and analysis, the LLM DeepSeek performed best, with an overall 93% correct entries and fully correct entries in seven out of 18 original articles. Three LLMs performed satisfactorily on this task, as they required slow, complex prompts and multiple uploads to obtain results, suggesting low time-efficiency relative to human work.

In the third task involving final manuscript drafting, none of the tested LLMs achieved satisfactory performance. Specifically, the LLMs generated short, uninspiring full articles that did not fully adhere to the standard template for a systematic review. 

The tested LLMs generated articles in a well-structured format and with correct scientific language, which could be misleading for non-expert readers. Since systematic reviews and meta-analyses are considered the gold standard in evidence-based medicine, a critical evaluation of published literature by human experts is essential to guide clinical practice effectively.

Conclusions

Modern LLMs cannot produce a systematic review in the medical domain without prompt-engineering strategies. Nevertheless, the observed improvements in LLMs between two rounds of evaluation indicate that, with appropriate supervision, LLMs can provide valuable support to researchers in certain aspects of the review process. In this context, recent evidence suggests that guided prompting strategies, such as knowledge-guided prompting, can enhance LLM performance on several review tasks.   

The current study included a single systematic review in the medical domain as a reference for comparison, which may restrict the generalizability of these findings to other scientific domains. Thus, future studies are needed to evaluate multiple systematic reviews across diverse biomedical and non-biomedical domains to improve robustness and external validity.

Journal reference:
  • Sollini, M., Pini, C., Lazar, A., et al. (2025). Human researchers are superior to large language models in writing a medical systematic review in a comparative multitask assessment. Scientific Reports. DOI: 10.1038/s41598-025-28993-5. https://www.nature.com/articles/s41598-025-28993-5
Dr. Sanchari Sinha Dutta

Written by

Dr. Sanchari Sinha Dutta

Dr. Sanchari Sinha Dutta is a science communicator who believes in spreading the power of science in every corner of the world. She has a Bachelor of Science (B.Sc.) degree and a Master's of Science (M.Sc.) in biology and human physiology. Following her Master's degree, Sanchari went on to study a Ph.D. in human physiology. She has authored more than 10 original research articles, all of which have been published in world renowned international journals.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dutta, Sanchari Sinha Dutta. (2025, December 07). Human researchers still outperform AI when it comes to writing trustworthy systematic reviews. News-Medical. Retrieved on December 07, 2025 from https://www.news-medical.net/news/20251207/Human-researchers-still-outperform-AI-when-it-comes-to-writing-trustworthy-systematic-reviews.aspx.

  • MLA

    Dutta, Sanchari Sinha Dutta. "Human researchers still outperform AI when it comes to writing trustworthy systematic reviews". News-Medical. 07 December 2025. <https://www.news-medical.net/news/20251207/Human-researchers-still-outperform-AI-when-it-comes-to-writing-trustworthy-systematic-reviews.aspx>.

  • Chicago

    Dutta, Sanchari Sinha Dutta. "Human researchers still outperform AI when it comes to writing trustworthy systematic reviews". News-Medical. https://www.news-medical.net/news/20251207/Human-researchers-still-outperform-AI-when-it-comes-to-writing-trustworthy-systematic-reviews.aspx. (accessed December 07, 2025).

  • Harvard

    Dutta, Sanchari Sinha Dutta. 2025. Human researchers still outperform AI when it comes to writing trustworthy systematic reviews. News-Medical, viewed 07 December 2025, https://www.news-medical.net/news/20251207/Human-researchers-still-outperform-AI-when-it-comes-to-writing-trustworthy-systematic-reviews.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Why health education isn’t stopping young adults from consuming energy drinks