Large language models perform poorly on routine hospital tasks

Download PDF Copy

Reviewed

PLOSMay 7 2026

A new study finds that large language models (LLMs), used with straightforward prompting, perform poorly on routine number-crunching tasks that hospital administrators depend on every day to track patients and allocate resources. The findings were published this week in the open-access journal PLOS Digital Health by Eyal Klang of the Icahn School of Medicine at Mount Sinai, New York, USA, and colleagues.

Hospitals rely on structured electronic health record (EHR) data to monitor patient counts and resources and to generate administrative reports. These tasks are currently handled by data analysts using programming languages, creating delays when staff need fast answers. AI tools known as large language models, such as GPT-4o and Llama, have been proposed to simplify that process.

In the new study, researchers evaluated nine leading LLMs on two basic administrative tasks-counting patients meeting a condition and filtering records based on multiple criteria-using data drawn from 50,000 real emergency department visits at the Mount Sinai Health System.

The researchers found that straightforward prompting-asking the model a plain question like "how many patients in this table were admitted?"-produced uniformly poor results across all models. Chain-of-thought reasoning, in which the model is prompted to show step-by-step work before giving an answer, offered only modest improvements that degraded sharply as table size increased. Even GPT-4o, the top-performing model, saw accuracy drop from roughly 95% on the smallest datasets to below 60% on larger ones under chain-of-thought conditions.

A tool-based approach-where models were asked to generate code that was then executed-substantially improved accuracy for the most capable models, with GPT-4o and Qwen-2.5-72B achieving near-perfect performance. However, distilled DeepSeek models, optimized for speed and efficiency, struggled even with this approach. One model, Llama-3.1-8B, failed to produce usable output in the majority of trials and was excluded from further analysis.

"Our findings indicate that without using a tool-based strategy, current LLMs are unsuitable for standalone use even on minimally complex administrative tasks in clinical settings," says Benjamin Glicksberg. "Structured data tasks in clinical workflows will require agentic approaches that combine LLMs with code execution to ensure accuracy and consistency."

Source:

PLOS

Journal reference:

Klang E, Sorin V, Korfiatis P, Sawant AS, Freeman R, Charney AW, et al. (2026) Large language models are poor clinical administrators: An evaluation of structured queries in real-world electronic health records. PLOS Digit Health 5(5): e0001326. https://doi.org/10.1371/journal.pdig.0001326

Posted in: Device / Technology News | Medical Research News | Healthcare News