Prompt engineering boosts AI's adherence to medical guidelines, study shows

NewsGuard 100/100 Score

In a recent study published in the journal npj Digital Medicine, a group of researchers examined the effectiveness of prompt engineering in improving the reliability and consistency of large language models (LLMs) for aligning with evidence-based clinical guidelines in medicine.

Study: Prompt engineering in consistency and reliability with the evidence-based guideline for LLMsStudy: Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs


LLMs have significantly progressed in natural language processing (NLP), showing promise for medical applications such as diagnosis and guideline adherence. However, their performance in the medical field varies, particularly in complex cases and consistency with guidelines, due to differing accuracy and reliability. Prompt engineering, which aims to refine prompts to elicit better responses from LLMs, appears to be a promising strategy for improving their performance in medical contexts. Further research is needed to enhance LLMs' accuracy, reliability, and relevance in medical settings, supporting clinical decision-making and patient care.

About the study 

The present study tested LLMs' consistency against the American Academy of Orthopedic Surgeons (AAOS) evidence-based osteoarthritis (OA) guidelines, supported by detailed evidence and covering treatments to patient education. The AAOS, being the largest global association of musculoskeletal specialists, offers OA guidelines that are supported by research evidence and encompass various management recommendations, making it an authoritative resource in the field.

The study implemented four distinct types of prompts: Input-Output (IO) prompting, Zero-Shot Chain of Thought (0-COT) prompting, Prompted Chain of Thought (P-COT) prompting, and Return on Thought (ROT) prompting, with the objective of examining the LLMs' adherence to the AAOS guidelines and the reliability of their responses upon repeated inquiries. These prompts were designed to facilitate the LLMs in generating responses that would be evaluated against the AAOS guidelines' recommendations.

Nine different LLMs were utilized, accessed either through web interfaces or Application Programming Interfaces (APIs), with fine-tuning performed as per protocols described on the OpenAI platform. Statistical analysis, conducted using SPSS and Python, focused on measuring the consistency and reliability of the LLMs' responses. Consistency was defined by the instances where the LLMs' recommendations matched precisely with those of the AAOS guidelines. At the same time, reliability was measured by the repeatability of responses to the same questions, assessed using the Fleiss kappa test. 

Study results 

The present study's findings highlighted generative pre-trained transformer (gpt)-4-Web as the superior model in terms of consistency, showcasing rates between 50.6% and 63% across different prompts. Comparatively, other models like gpt-3.5-ft-0 and gpt-4-API-0 demonstrated lower consistency rates with specific prompts, with the highest consistency observed with ROT prompting in gpt-4-Web. This suggests that the integration of gpt-4-Web with ROT prompting most effectively aligns with clinical guidelines. An analysis across various models and prompts revealed a broad range of consistency rates, with gpt-4 models achieving up to 62.9% and gpt-3.5 models, including fine-tuned versions, reaching up to 55.3%. Bard models showed a consistency range from 19.4% to 44.1%, indicating variable effectiveness of prompts across different LLMs.

Subgroup analysis was conducted based on the AAOS's categorization of recommendation levels from strong to consensus. This analysis aimed to discern whether the strength of evidence impacted consistency rates. It was found that at moderate evidence levels, no significant differences in consistency rates were observed within gpt-4-Web. However, notable differences emerged at the limited evidence level, where ROT and IO prompting significantly outperformed P-COT prompting in gpt-4-Web. Despite these findings, consistency levels in other models generally remained below 70%.

Reliability assessment using the Fleiss kappa test varied widely among the models and prompts, with values ranging from -0.002 to 0.984. This variability indicates differing levels of repeatability in responses to the same questions across models and prompts. Notably, IO prompting in gpt-3.5-ft-0 and gpt-3.5-API-0 demonstrated almost perfect reliability, while P-COT prompting in gpt-4-API-0 showed substantial reliability. However, the overall reliability of other prompts and models was moderate or lower.

Invalid data were categorized and processed according to specific procedures, with a significant portion of responses to certain prompts being considered invalid, particularly in gpt-3.5-API-0. This contrasted with gpt-4-Web, which had a relatively low rate of invalid responses. 


To summarize, the study highlights the impact of prompt engineering on the accuracy of LLMs in medical responses, particularly noting the superior performance of gpt-4-Web with ROT prompting in adhering to clinical guidelines for OA. It underscores the importance of combining prompt engineering, parameter settings, and fine-tuning to enhance LLM utility in clinical medicine. The findings advocate for further exploration into prompt engineering strategies and the development of evaluation frameworks involving healthcare professionals and patients, aiming to improve LLM effectiveness and reliability in medical settings.

Journal reference:
Vijay Kumar Malesu

Written by

Vijay Kumar Malesu

Vijay holds a Ph.D. in Biotechnology and possesses a deep passion for microbiology. His academic journey has allowed him to delve deeper into understanding the intricate world of microorganisms. Through his research and studies, he has gained expertise in various aspects of microbiology, which includes microbial genetics, microbial physiology, and microbial ecology. Vijay has six years of scientific research experience at renowned research institutes such as the Indian Council for Agricultural Research and KIIT University. He has worked on diverse projects in microbiology, biopolymers, and drug delivery. His contributions to these areas have provided him with a comprehensive understanding of the subject matter and the ability to tackle complex research challenges.    


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Kumar Malesu, Vijay. (2024, February 21). Prompt engineering boosts AI's adherence to medical guidelines, study shows. News-Medical. Retrieved on April 17, 2024 from

  • MLA

    Kumar Malesu, Vijay. "Prompt engineering boosts AI's adherence to medical guidelines, study shows". News-Medical. 17 April 2024. <>.

  • Chicago

    Kumar Malesu, Vijay. "Prompt engineering boosts AI's adherence to medical guidelines, study shows". News-Medical. (accessed April 17, 2024).

  • Harvard

    Kumar Malesu, Vijay. 2024. Prompt engineering boosts AI's adherence to medical guidelines, study shows. News-Medical, viewed 17 April 2024,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Feeling lonely? It may affect how your brain reacts to food, new research suggests