Leading AI models ace many vaccine questions but falter on clinical rules

New research shows that leading AI models can handle many vaccine questions, but their mistakes on schedules, contraindications, and eligibility highlight why medical oversight remains essential.

Vaccine knowledge base: construction and LLM processing pipeline. Paper: Evaluating large language models on multilingual vaccine knowledge: a benchmark study 

In a recent study published in npj Vaccines, a group of researchers evaluated how accurately large language models (LLMs) answer vaccine-related questions across different vaccines, languages, and prompting strategies.

Background

Many people increasingly use digital tools, including artificial intelligence (AI) chatbots, to seek health information. Many people now ask LLMs questions about vaccines, from safety concerns to vaccination schedules. However, incorrect answers in this area could influence healthcare decisions and public trust.

Vaccines are one of the most effective public health interventions, but vaccine hesitancy is an increasing challenge to global immunization efforts. Therefore, it is important to determine whether AI can provide accurate and timely vaccine information across language barriers. 

About the Study

The researchers developed a multilingual vaccine knowledge benchmark, VaxEval, to assess the performance of contemporary LLMs. The benchmark contained 1,886 multiple-choice questions covering 14 vaccines and three United Nations languages: English, Spanish, and Chinese. Topics covered by these questions include vaccination schedules, efficacy, safety, adverse effects, debunking myths, access, and disease prevention.

Data for the questions were taken from reputable health organizations, including the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), the United Nations Children's Fund (UNICEF), the Africa CDC, the American Medical Association (AMA), and Immunize.org. Additional material was obtained from peer-reviewed scientific literature. All questions underwent extensive quality checks, and answer keys were verified against trusted scientific sources.

Researchers assessed 13 LLMs, including Generative Pre-trained Transformer (GPT)-4.5, GPT-4o, GPT-4, GPT-3.5-Turbo, Claude 3 Opus, Gemini 1.5 Pro, Llama-4 Maverick, DeepSeek-V3, Grok-3, Qwen 2.5, General Language Model 4 (GLM-4), Reka Core, and Yi-Lightning. Models used three prompting methods: zero-shot, few-shot, and chain-of-thought.

Models' responses were assessed for their ability to submit the correct answer option. Subsequently, statistical analysis, including mixed-effects logistic regression, was performed to identify characteristics of correct and incorrect answers and to compare the models' performance across languages, vaccine types, and model groups.

Study Results

The benchmark included 1,340 English, 250 Spanish, and 296 Chinese questions. The average accuracy across all models was 86.0% for English, 83.7% for Spanish, and 80.0% for Chinese. This indicates that LLMs have substantial vaccine-related knowledge across the three languages, though performance varies by language.

Among the evaluated systems, GPT-4o achieved the highest overall accuracy at 90.3%, closely followed by Llama-4 Maverick at 90.2% and DeepSeek-V3 at 89.6%. As a group, newer flagship models outperformed earlier-generation models.

Statistical analysis showed that flagship models had 57% higher odds of providing correct answers than older systems, although GPT-4o, which was classified as an earlier model in this study, still achieved the highest overall accuracy.

The type of prompting was also a factor in how well a model performed. The few-shot prompts gave the best results, increasing the likelihood of correct responses by 17% compared with zero-shot prompting.

The use of chain-of-thought prompts had an opposite effect than expected; they were associated with 21% lower odds of answer correctness than zero-shot prompts. This suggests that encouraging models to generate step-by-step reasoning may not always improve factual accuracy in structured vaccine-related tasks.

Performance differed considerably across vaccine types. The highest accuracy rates were observed for influenza (90.5%), hepatitis A (89.5%), human papillomavirus (HPV) (88.4%), and Coronavirus disease 2019 (COVID-19) vaccines (85.3%).

Vaccines for respiratory syncytial virus (RSV) (80.6%), meningococcal disease (81.7%), pneumococcal disease (77.7%), and dengue (76.4%) were among the lower-performing vaccine categories. These findings indicate that models performed better on widely discussed vaccines that are heavily represented in public health communication.

Models achieved the highest accuracy for misconceptions and corrections (93.0%), prevention-related questions (90.0%), and regulatory or monitoring systems (87.2%). Lower performance was observed for vaccine types and basic information (82.5%), effectiveness and benefits (86.3%), cost and accessibility (82.6%), and dosing or recommendation questions (82.5%).

Language analyses showed that Spanish and Chinese questions were less likely to be answered correctly than English questions. Additional analysis of semantically aligned multilingual questions showed that many of these differences were linked to variations in dataset composition rather than inherent language bias.

The authors also noted that the Spanish and Chinese datasets were independently constructed rather than direct translations of the English questions, which may have contributed to differences in item difficulty, topic distribution, and source composition.

Error analysis highlighted model weaknesses: nearly half of a sampled set of 150 incorrect responses resulted from overgeneralization, in which models supplied broad statements without considering vaccine-specific requirements.

Other common errors included incorrect dosing intervals, misidentification of contraindications, incorrect recommendations for age-based eligibility, and inability to distinguish between vaccine types. These types of errors were of particular concern because they relate to practical guidance that may affect vaccination decisions.

Conclusions

The findings show that modern LLMs possess strong knowledge of vaccines and can accurately answer most vaccine questions across multiple languages.

Newer flagship models substantially outperformed earlier systems at the group level, and few-shot prompting improved performance. However, many significant weaknesses remain in areas requiring explicit clinical guidance.

In addition, accuracy across different vaccines and languages remains inconsistent. Although these systems show promise for supporting vaccine education and public health communication, their remaining error rates highlight the need for careful oversight, continuous evaluation, and structured safeguards before widespread deployment in health-related settings. 

The authors also emphasized that multiple-choice accuracy does not establish clinical reliability or readiness for real-world vaccine counseling without prospective validation and context-specific safety evaluation.

Further studies are needed to assess the accuracy, safety, and real-world effectiveness of AI-supported health communication.

Download your PDF copy by clicking here.

Journal reference:
  • Chen, S., Wass, L., Wu, Z., Garay, L., Vizoso, J., Leung, K., Wu, J., & Lin, L. (2026). Evaluating large language models on multilingual vaccine knowledge: A benchmark study. npj Vaccines. DOI: 10.1038/s41541-026-01500-1, https://www.nature.com/articles/s41541-026-01500-1
Vijay Kumar Malesu

Written by

Vijay Kumar Malesu

Vijay holds a Ph.D. in Biotechnology and possesses a deep passion for microbiology. His academic journey has allowed him to delve deeper into understanding the intricate world of microorganisms. Through his research and studies, he has gained expertise in various aspects of microbiology, which includes microbial genetics, microbial physiology, and microbial ecology. Vijay has six years of scientific research experience at renowned research institutes such as the Indian Council for Agricultural Research and KIIT University. He has worked on diverse projects in microbiology, biopolymers, and drug delivery. His contributions to these areas have provided him with a comprehensive understanding of the subject matter and the ability to tackle complex research challenges.    

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Kumar Malesu, Vijay. (2026, June 11). Leading AI models ace many vaccine questions but falter on clinical rules. News-Medical. Retrieved on June 12, 2026 from https://www.news-medical.net/news/20260611/Leading-AI-models-ace-many-vaccine-questions-but-falter-on-clinical-rules.aspx.

  • MLA

    Kumar Malesu, Vijay. "Leading AI models ace many vaccine questions but falter on clinical rules". News-Medical. 12 June 2026. <https://www.news-medical.net/news/20260611/Leading-AI-models-ace-many-vaccine-questions-but-falter-on-clinical-rules.aspx>.

  • Chicago

    Kumar Malesu, Vijay. "Leading AI models ace many vaccine questions but falter on clinical rules". News-Medical. https://www.news-medical.net/news/20260611/Leading-AI-models-ace-many-vaccine-questions-but-falter-on-clinical-rules.aspx. (accessed June 12, 2026).

  • Harvard

    Kumar Malesu, Vijay. 2026. Leading AI models ace many vaccine questions but falter on clinical rules. News-Medical, viewed 12 June 2026, https://www.news-medical.net/news/20260611/Leading-AI-models-ace-many-vaccine-questions-but-falter-on-clinical-rules.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Universal Sarbeco coronavirus vaccine proves safe in first human trial