Large language models perform poorly on routine hospital tasks

A new study finds that large language models (LLMs), used with straightforward prompting, perform poorly on routine number-crunching tasks that hospital administrators depend on every day to track patients and allocate resources. The findings were published this week in the open-access journal PLOS Digital Health by Eyal Klang of the Icahn School of Medicine at Mount Sinai, New York, USA, and colleagues.

Hospitals rely on structured electronic health record (EHR) data to monitor patient counts and resources and to generate administrative reports. These tasks are currently handled by data analysts using programming languages, creating delays when staff need fast answers. AI tools known as large language models, such as GPT-4o and Llama, have been proposed to simplify that process.

In the new study, researchers evaluated nine leading LLMs on two basic administrative tasks-counting patients meeting a condition and filtering records based on multiple criteria-using data drawn from 50,000 real emergency department visits at the Mount Sinai Health System.

The researchers found that straightforward prompting-asking the model a plain question like "how many patients in this table were admitted?"-produced uniformly poor results across all models. Chain-of-thought reasoning, in which the model is prompted to show step-by-step work before giving an answer, offered only modest improvements that degraded sharply as table size increased. Even GPT-4o, the top-performing model, saw accuracy drop from roughly 95% on the smallest datasets to below 60% on larger ones under chain-of-thought conditions.

A tool-based approach-where models were asked to generate code that was then executed-substantially improved accuracy for the most capable models, with GPT-4o and Qwen-2.5-72B achieving near-perfect performance. However, distilled DeepSeek models, optimized for speed and efficiency, struggled even with this approach. One model, Llama-3.1-8B, failed to produce usable output in the majority of trials and was excluded from further analysis.

"Our findings indicate that without using a tool-based strategy, current LLMs are unsuitable for standalone use even on minimally complex administrative tasks in clinical settings," says Benjamin Glicksberg. "Structured data tasks in clinical workflows will require agentic approaches that combine LLMs with code execution to ensure accuracy and consistency."

Source:
Journal reference:

Klang E, Sorin V, Korfiatis P, Sawant AS, Freeman R, Charney AW, et al. (2026) Large language models are poor clinical administrators: An evaluation of structured queries in real-world electronic health records. PLOS Digit Health 5(5): e0001326. https://doi.org/10.1371/journal.pdig.0001326

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Study explores how a powerful antiseptic affects bacteria in hospital environments