Popular chatbots often provide inaccurate and incomplete medical information

Download PDF Copy

Reviewed

BMJ GroupApr 15 2026

A substantial amount of medical information provided by 5 popular chatbots is inaccurate and incomplete, with half of the answers to clear evidence based questions "somewhat" or "highly" problematic, show the results of a study published in the open access journal BMJ Open.

Continued deployment of these chatbots without public education and oversight risks amplifying misinformation, warn the researchers.

Generative AI chatbots have been rapidly adopted across research, education, business, marketing and medicine, with many people using them like search engines, including for everyday health and medical queries, explain the researchers.

To gauge the level of accuracy provided in areas of health and medicine already prone to misinformation, and therefore with consequences for everyday health behaviour, the researchers probed 5 publicly available and popular generative AI chatbots in February 2025: Gemini (Google); DeepSeek (High-Flyer); Meta AI (Meta); ChatGPT (OpenAI); and Grok (xAI).

Each chatbot was prompted with 10 open ended and closed questions in each of 5 categories of cancer, vaccines, stem cells, nutrition, and athletic performance. The prompts were designed to resemble common 'information-seeking' health and medical queries and misinformation tropes online and in academic discourse.

And they were developed to 'strain' models towards misinformation or contraindicated advice-a strategy increasingly used for stress testing AI chatbots and picking up behavioural vulnerabilities, note the researchers.

Closed prompts required chatbots to provide pre-defined responses, often with one correct answer, that aligned with the scientific consensus. Open ended prompts typically required chatbots to generate multiple responses in list form.

Responses were categorized as non-, somewhat, or highly problematic, using objective pre-defined criteria. A problematic response was defined as one that could plausibly direct lay users to potentially ineffective treatment or come to harm if followed without professional guidance.

The information was scored for accuracy and completeness, and particular attention was given to whether a chatbot presented a false balance between science and non-science based claims, regardless of the strength of the evidence.

Each response was also graded on readability, ranging from whether it was written in easy, plain English, to difficult, academic language, using the Flesch Reading Ease score.

Half (50%) the responses were problematic: 30% were somewhat, and 20% were highly problematic.

Prompt type was influential: open ended prompts, for example, produced 40 highly problematic responses- significantly more than expected--and 51 non-problematic responses-significantly fewer than expected. The opposite was true of closed prompts.

While the quality of responses didn't differ significantly among the 5 chatbots, Grok generated significantly more highly problematic responses than would be expected (29/50; 58%). Gemini generated the fewest highly problematic responses and the most non-problematic ones.

The chatbots performed best in the area of vaccines and cancer, and worst in the area of stem cells, athletic performance, and nutrition.

Answers were consistently expressed with confidence and certainty, with few caveats or disclaimers. Out of the total 250 questions, there were only two refusals to answer, both of which came from Meta AI in response to queries about anabolic steroids and alternative cancer treatments.

Reference quality was poor, with an average completeness score of 40%. Chatbot hallucinations and fabricated citations meant that no chatbot provided a fully accurate reference list.

All readability scores were graded as 'difficult, equivalent in complexity to suitability for a college graduate.

The researchers acknowledge that they only assessed 5 chatbots and that commercial AI is rapidly evolving, so their findings might not be universally applicable. And not all real-world queries are deliberately adversarial, an approach they took which may have overstated the prevalence of problematic content.

Nevertheless, "Our findings regarding scientific accuracy, reference quality, and response readability highlight important behavioral limitations and the need to re-evaluate how AI chatbots are deployed in public-facing health and medical communication," they point out.

"By default, chatbots do not access real-time data but instead generate outputs by inferring statistical patterns from their training data and predicting likely word sequences. They do not reason or weigh evidence, nor are they able to make ethical or value-based judgments," they explain.

"This behavioral limitation means that chatbots can reproduce authoritative-sounding but potentially flawed responses."

The data chatbots draw on also includes Q&A forums and social media, and scientific content is typically limited to open access or publicly available articles, which comprise only 30–50% of published studies. While this enhances conversational fluency, it may come at the cost of scientific accuracy, advise the researchers.

"As the use of AI chatbots continues to expand, our data highlight a need for public education, professional training, and regulatory oversight to ensure that generative AI supports, rather than erodes, public health," they conclude.

Source:

BMJ Group

Journal reference:

Tiller, N. B., et al. (2026). Generative artificial intelligence-driven chatbots and medical misinformation: an accuracy, referencing and readability audit. BMJ Open. DOI: 10.1136/bmjopen-2025-112695. https://bmjopen.bmj.com/content/16/4/e112695

Posted in: Device / Technology News | Healthcare News