Artificial intelligence (AI)-powered chatbots respond to everyday health-related questions from general users with nearly 76% accuracy, which raises concerns about their trustworthiness in real-world client-facing applications, according to a new study led by Penn State researchers.
The researchers wanted to understand how the average person uses AI for health-related concerns and how accurately AI responds to everyday medical queries. They found that when it comes to healthcare, especially specialized areas like neurology and dermatology, AI tools may work best in the hands of trained physicians rather than patients. The team will present their findings at the 2026 Association for Computing Machinery Fairness, Accountability and Transparency (FAccT) conference in Montreal, Canada, June 25-28.
Our work focuses explicitly on healthcare scenarios that the average internet user might ask AI, which is a perspective that prior research into large language models (LLMs) and healthcare hasn't covered. We wanted to understand that if people are using LLMs like ChatGPT as a symptom health checker, like historically we've used Google, how accurate is the LLM in answering those queries, and how harmful could those responses be?"
Amulya Yadav, study co-author, associate professor of informatics and intelligent systems in Penn State's College of Information Sciences and Technology (IST)
To understand how accurate or harmful health-related LLM responses could be for the average internet user, the researchers held an AI competition called a Diagnose-a-thon at Penn State. A total of 34 participants - comprising faculty, staff and undergraduate and graduate students - submitted 212 prompts and AI-generated responses to real and imaginary health concerns written from both patient and doctor perspectives. Participants were allowed to choose one of four LLMs to use for the contest: ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro and Llama3-8b.
"One of the strengths of our study is we're essentially trying to replicate real-world usage of LLMs by telling participants to choose the LLM of their choice and use it as they would on a normal day," said Bonam Mingole, lead author of the study and doctoral candidate in information sciences and technology. "This type of participatory research is so important for understanding how the public uses AI in their daily life."
The researchers then asked nine board-certified physicians to evaluate the accuracy of the AI-generated responses and how harmful they may be using a six-point scale ranging from very low to very high. A competition committee awarded prizes to the top eight submissions that generated the most medically accurate information and a prize to the submission that generated the response most likely to cause harm.
They found that overall, 76.2% of LLM-generated responses provided accurate information. Specialties such as obstetrics and gynecology and otolaryngology - the treatment of disorders that affect the ear, nose and throat - saw the best LLM performance, with high validity scores and low harm scores. Internal medicine, neurology and dermatology saw the worst AI performance, with low validity scores and higher harm scores, according to the researchers. They added that very specific prompts, and prompts between 60 and 250 characters, resulted in more accurate LLM outputs.
The researchers then took the base model of each LLM and trained it on medical textbooks, clinical guidelines and peer-reviewed research articles included in a medical school curriculum to see if additional training would increase response validity scores and decrease harm scores. They asked a panel of seven medical professionals and trainees - a board-certified physician, two second-year internal medicine residents, two fourth-year medical students and two third-year medical students - to assess the base LLM responses and responses from the augmented LLMs and determine which were more clinically appropriate. The researchers found that the panel preferred the responses from the Gemini and Llama base models over the augmented models, and no significant preference for the ChatGPT models.
"We're entering a new age of healthcare, and AI is a significant part of it," said study co-author Jennifer Kraschnewski, director of the Penn State Clinical and Translational Science Institute and professor in internal medicine at the Penn State College of Medicine. "There's a real opportunity for healthcare to transform, to integrate these new tools so that clinicians like myself can use them to improve patient care."
The researchers also noted that despite the LLM validity scores, AI error rates still exceeded 20%, roughly double the error rate of human physicians. Those errors, they said, could potentially be harmful to patients.
"I don't think AI will replace human physicians, but I do think there's a huge opportunity for us to help upskill today's physician in a way that's never been done before," said Kraschnewski, suggesting that current LLMs may prove better tools for medical professionals than patients.
Overall, the study highlights the potential beneficial and harmful impacts that AI may have on a key aspect of everyone's life, according to the researchers.
"Like it or not, people will continue to use AI for diagnosing their health problems," said study co-author S. Shyam Sundar, Evan Pugh University Professor and James P. Jimirro Professor of Media Effects at Penn State. "By understanding their use patterns and testing the validity of AI performance, our project helps advance literacy on the best and worst uses of AI for medical advice."
Aditya Majumdar and Firdaus Ahmed Choudhury, doctoral students in Penn State's College of IST, also contributed to the study. The Center for Socially Responsible Artificial Intelligence at Penn State hosted the Diagnose-a-thon competition.
Source:
Journal reference:
Mingole, B., et al. (2026) Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases. DOI: 10.48550/arXiv.2506.13805. https://arxiv.org/abs/2506.13805