Despite rapid advances in large language models, this study shows that human expertise remains critical for producing rigorous systematic reviews, with AI best suited as a supervised support tool rather than an independent author.
Study: Human researchers are superior to large language models in writing a medical systematic review in a comparative multitask assessment. Image Credit: Summit Art Creations / Shutterstock.com
A recent study published in the journal Scientific Reports reveals that human researchers perform better than large language models (LLMs) in preparing systematic literature reviews.
What are LLMs?
LLMs are advanced artificial intelligence (AI) systems that use deep learning methods to analyze a vast amount of input data and generate human-like language. Since the introduction of OpenAI’s ChatGPT in 2022, LLMs have gained significant public attention for their ability to perform a wide range of everyday tasks, including text generation, language translation, email writing, and much more.
LLMs have become an integral part of healthcare, education, and research sectors due to their ability to both interpret and generate text. In fact, several studies have demonstrated that LLMs such as GPT-4 and BERT can perform a wide range of medical tasks, including annotation of ribonucleic acid (RNA) sequencing data, content summarization, and medical report drafting.
In scientific research, LLMs have been utilized for literature screening and summarization, data analysis, and report generation. Despite immense potential to accelerate scientific processes, the responsible integration of LLMs in healthcare, education, and research domains requires a comprehensive analysis of potential challenges, including ensuring data consistency, mitigating biases, and maintaining transparency in their applications.
Study design
To elucidate the risks and benefits of integrating LLMs into key scientific areas, the current study investigated whether LLMs outperform human researchers in conducting systematic literature reviews. To this end, six different LLMs were used to perform literature searches, article screening and selection, data extraction and analysis, and the final drafting of the systematic review.
All outcomes were compared with the original systematic review written by human researchers on the same topic. This process was repeated twice to evaluate between-version changes and improvements of LLMs over time.
Key findings and significance
In the first task that included literature search and selection, the LLM Gemini performed the best by selecting 13 out of 18 scientific articles that were included in the original systematic review produced by human researchers. Nevertheless, significant limitations in LLMs' ability to perform key tasks were observed, including literature search, data summarization, and final manuscript drafting.
These limitations likely reflect the lack of access that many LLMs have to electronic databases for scientific articles. Additionally, the training datasets used for these models may contain relatively few original research articles, which further reduces their accuracy.
Despite non-satisfactory performance on the first task, LLMs extracted several appropriate articles more quickly than human researchers. Thus, the time-effectiveness of LLMs can be utilized for initial literature screening, alongside the standard cross-search of databases and references by human researchers.
In the second task for data extraction and analysis, the LLM DeepSeek performed best, with an overall 93% correct entries and fully correct entries in seven out of 18 original articles. Three LLMs performed satisfactorily on this task, as they required slow, complex prompts and multiple uploads to obtain results, suggesting low time-efficiency relative to human work.
In the third task involving final manuscript drafting, none of the tested LLMs achieved satisfactory performance. Specifically, the LLMs generated short, uninspiring full articles that did not fully adhere to the standard template for a systematic review.
The tested LLMs generated articles in a well-structured format and with correct scientific language, which could be misleading for non-expert readers. Since systematic reviews and meta-analyses are considered the gold standard in evidence-based medicine, a critical evaluation of published literature by human experts is essential to guide clinical practice effectively.
Conclusions
Modern LLMs cannot produce a systematic review in the medical domain without prompt-engineering strategies. Nevertheless, the observed improvements in LLMs between two rounds of evaluation indicate that, with appropriate supervision, LLMs can provide valuable support to researchers in certain aspects of the review process. In this context, recent evidence suggests that guided prompting strategies, such as knowledge-guided prompting, can enhance LLM performance on several review tasks.
The current study included a single systematic review in the medical domain as a reference for comparison, which may restrict the generalizability of these findings to other scientific domains. Thus, future studies are needed to evaluate multiple systematic reviews across diverse biomedical and non-biomedical domains to improve robustness and external validity.
Journal reference:
- Sollini, M., Pini, C., Lazar, A., et al. (2025). Human researchers are superior to large language models in writing a medical systematic review in a comparative multitask assessment. Scientific Reports. DOI: 10.1038/s41598-025-28993-5. https://www.nature.com/articles/s41598-025-28993-5