New research shows that leading AI models can handle many vaccine questions, but their mistakes on schedules, contraindications, and eligibility highlight why medical oversight remains essential.

Vaccine knowledge base: construction and LLM processing pipeline. Paper: Evaluating large language models on multilingual vaccine knowledge: a benchmark study
In a recent study published in npj Vaccines, a group of researchers evaluated how accurately large language models (LLMs) answer vaccine-related questions across different vaccines, languages, and prompting strategies.
Background
Many people increasingly use digital tools, including artificial intelligence (AI) chatbots, to seek health information. Many people now ask LLMs questions about vaccines, from safety concerns to vaccination schedules. However, incorrect answers in this area could influence healthcare decisions and public trust.
Vaccines are one of the most effective public health interventions, but vaccine hesitancy is an increasing challenge to global immunization efforts. Therefore, it is important to determine whether AI can provide accurate and timely vaccine information across language barriers.
About the Study
The researchers developed a multilingual vaccine knowledge benchmark, VaxEval, to assess the performance of contemporary LLMs. The benchmark contained 1,886 multiple-choice questions covering 14 vaccines and three United Nations languages: English, Spanish, and Chinese. Topics covered by these questions include vaccination schedules, efficacy, safety, adverse effects, debunking myths, access, and disease prevention.
Data for the questions were taken from reputable health organizations, including the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), the United Nations Children's Fund (UNICEF), the Africa CDC, the American Medical Association (AMA), and Immunize.org. Additional material was obtained from peer-reviewed scientific literature. All questions underwent extensive quality checks, and answer keys were verified against trusted scientific sources.
Researchers assessed 13 LLMs, including Generative Pre-trained Transformer (GPT)-4.5, GPT-4o, GPT-4, GPT-3.5-Turbo, Claude 3 Opus, Gemini 1.5 Pro, Llama-4 Maverick, DeepSeek-V3, Grok-3, Qwen 2.5, General Language Model 4 (GLM-4), Reka Core, and Yi-Lightning. Models used three prompting methods: zero-shot, few-shot, and chain-of-thought.
Models' responses were assessed for their ability to submit the correct answer option. Subsequently, statistical analysis, including mixed-effects logistic regression, was performed to identify characteristics of correct and incorrect answers and to compare the models' performance across languages, vaccine types, and model groups.
Study Results
The benchmark included 1,340 English, 250 Spanish, and 296 Chinese questions. The average accuracy across all models was 86.0% for English, 83.7% for Spanish, and 80.0% for Chinese. This indicates that LLMs have substantial vaccine-related knowledge across the three languages, though performance varies by language.
Among the evaluated systems, GPT-4o achieved the highest overall accuracy at 90.3%, closely followed by Llama-4 Maverick at 90.2% and DeepSeek-V3 at 89.6%. As a group, newer flagship models outperformed earlier-generation models.
Statistical analysis showed that flagship models had 57% higher odds of providing correct answers than older systems, although GPT-4o, which was classified as an earlier model in this study, still achieved the highest overall accuracy.
The type of prompting was also a factor in how well a model performed. The few-shot prompts gave the best results, increasing the likelihood of correct responses by 17% compared with zero-shot prompting.
The use of chain-of-thought prompts had an opposite effect than expected; they were associated with 21% lower odds of answer correctness than zero-shot prompts. This suggests that encouraging models to generate step-by-step reasoning may not always improve factual accuracy in structured vaccine-related tasks.
Performance differed considerably across vaccine types. The highest accuracy rates were observed for influenza (90.5%), hepatitis A (89.5%), human papillomavirus (HPV) (88.4%), and Coronavirus disease 2019 (COVID-19) vaccines (85.3%).
Vaccines for respiratory syncytial virus (RSV) (80.6%), meningococcal disease (81.7%), pneumococcal disease (77.7%), and dengue (76.4%) were among the lower-performing vaccine categories. These findings indicate that models performed better on widely discussed vaccines that are heavily represented in public health communication.
Models achieved the highest accuracy for misconceptions and corrections (93.0%), prevention-related questions (90.0%), and regulatory or monitoring systems (87.2%). Lower performance was observed for vaccine types and basic information (82.5%), effectiveness and benefits (86.3%), cost and accessibility (82.6%), and dosing or recommendation questions (82.5%).
Language analyses showed that Spanish and Chinese questions were less likely to be answered correctly than English questions. Additional analysis of semantically aligned multilingual questions showed that many of these differences were linked to variations in dataset composition rather than inherent language bias.
The authors also noted that the Spanish and Chinese datasets were independently constructed rather than direct translations of the English questions, which may have contributed to differences in item difficulty, topic distribution, and source composition.
Error analysis highlighted model weaknesses: nearly half of a sampled set of 150 incorrect responses resulted from overgeneralization, in which models supplied broad statements without considering vaccine-specific requirements.
Other common errors included incorrect dosing intervals, misidentification of contraindications, incorrect recommendations for age-based eligibility, and inability to distinguish between vaccine types. These types of errors were of particular concern because they relate to practical guidance that may affect vaccination decisions.
Conclusions
The findings show that modern LLMs possess strong knowledge of vaccines and can accurately answer most vaccine questions across multiple languages.
Newer flagship models substantially outperformed earlier systems at the group level, and few-shot prompting improved performance. However, many significant weaknesses remain in areas requiring explicit clinical guidance.
In addition, accuracy across different vaccines and languages remains inconsistent. Although these systems show promise for supporting vaccine education and public health communication, their remaining error rates highlight the need for careful oversight, continuous evaluation, and structured safeguards before widespread deployment in health-related settings.
The authors also emphasized that multiple-choice accuracy does not establish clinical reliability or readiness for real-world vaccine counseling without prospective validation and context-specific safety evaluation.
Further studies are needed to assess the accuracy, safety, and real-world effectiveness of AI-supported health communication.
Download your PDF copy by clicking here.
Journal reference:
- Chen, S., Wass, L., Wu, Z., Garay, L., Vizoso, J., Leung, K., Wu, J., & Lin, L. (2026). Evaluating large language models on multilingual vaccine knowledge: A benchmark study. npj Vaccines. DOI: 10.1038/s41541-026-01500-1, https://www.nature.com/articles/s41541-026-01500-1