Can generative AI truly transform healthcare into a more personalized experience?

NewsGuard 100/100 Score

In a recent article published in npj Digital Medicine, researchers explored the current literature on large language model (LLM)-based evaluation metrics for healthcare chatbots.

They developed a set of evaluation metrics covering language processing, real-world clinical impact, and conversational effectiveness to assess healthcare chatbots from an end-user perspective.

Further, they discussed the challenges in implementing these metrics and offered future directions for an effective evaluation framework.

Study: Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. Image Credit: olya osyunina/Shutterstock.comStudy: Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. Image Credit: olya osyunina/


Artificial intelligence (AI), especially in healthcare chatbots, revolutionizes patient care by enabling interactive, personalized, and proactive assistance across various medical tasks and services.

Therefore, establishing comprehensive evaluation metrics is crucial for enhancing the chatbots' performance and ensuring the delivery of reliable and accurate medical services. However, the existing metrics lack standardization and fail to capture essential medical concepts, hindering their effectiveness.

Further, the current metrics fail to consider important user-centered aspects, including emotional connection, ethical implications, safety concerns like hallucinations, and computational efficiency and empathy in chatbot interactions.

Addressing these gaps, researchers in the present article introduced user-centered evaluation metrics for healthcare chatbots and discussed the challenges and significance associated with their implementation.

Existing evaluation metrics for LLMs

The evaluation of language models involves intrinsic and extrinsic methods, which may be automatic or manual. Intrinsic metrics assess the proficiency in generating coherent sentences, while extrinsic metrics gauge the performance in a real-world context.

Existing intrinsic metrics, such as BLEU (short for bilingual evaluation understudy) and ROUGE (short for recall-oriented understudy for gisting evaluation), lack semantic understanding, leading to inaccuracies in assessing healthcare chatbots.

Extrinsic metrics, including general-purpose and health-specific ones, offer subjective assessments from human perspectives. However, the current evaluations fail to consider crucial aspects like empathy, reasoning, and up-to-dateness.

Multi-metric approaches such as HELM (short for holistic evaluation of language models) provide comprehensive evaluations but fail to capture all essential elements required for assessing healthcare chatbots thoroughly. Therefore, there's a need for more inclusive and user-centered evaluation metrics in this domain.

Essential metrics for evaluating healthcare chatbots

In the present paper, the researchers outlined a comprehensive set of metrics for the user-centered evaluation of LLM-based healthcare chatbots, aiming to distinguish this approach from existing studies.

The evaluation process involves interacting with chatbots and assigning scores to various metrics, considering user perspectives. Three essential confounding variables are user type, domain type, and task type.

User type encompasses patients, healthcare providers, etc., influencing safety and privacy considerations. Domain type determines the breadth of topics covered, while task type influences metric scoring based on specific functions like diagnosis or assistance.

Metrics are categorized into four groups: Accuracy, trustworthiness, empathy, and performance. Accuracy metrics assess grammar, semantics, and structure, adapted to domains and tasks.

Trustworthiness metrics encompass safety, privacy, bias, and interpretability, which are crucial for responsible AI.

Empathy metrics evaluate emotional support, health literacy, fairness, and personalization tailored to user needs. Performance metrics ensure usability and latency, considering memory efficiency, floating point operations, token limit, and model parameters.

These metrics collectively provide a comprehensive framework for evaluating healthcare chatbots from diverse perspectives, enhancing their reliability and effectiveness in real-world applications.


The challenges in assessing healthcare chatbots are categorized into three groups: Metrics association, evaluation methods, and model prompt techniques and parameters.

Metrics association involves within-category and between-category relations, impacting metric correlations. For instance, within accuracy metrics, up-to-dateness positively correlates with groundedness.

Between-category relations occur, where trustworthiness and empathy metrics may be correlated due to empathy's need for personalization, potentially compromising privacy. Performance metrics also influence other categories, such as the number of parameters affecting accuracy, trustworthiness, and empathy.

Evaluation methods encompass automatic and human-based approaches, with benchmark selection crucial for comprehensive evaluation, considering confounding variables. Human-based methods face subjectivity and require diverse domain expert annotators for accurate scoring.

Model prompt techniques and parameters significantly affect chatbot responses. Various prompting methods and parameter adjustments influence chatbot behavior and metric scores. For example, modifying beam search or temperature parameters impacts the safety and other metric scores.

These challenges highlight the complexity of healthcare chatbot evaluation, necessitating careful consideration of metric associations, evaluation methods, and model parameters for accurate assessment and leaderboard representation.

Towards an effective evaluation framework

To ensure effective evaluation and comparison of different healthcare chatbot models, it is crucial for healthcare researchers to carefully consider all the configurable environments introduced, including confounding variables, prompt techniques and parameters, and evaluation methods.

While the “interface” enables users to configure the environment, the “interacting users” (evaluators and healthcare research teams) utilize the framework for assessment and model development.

Further, the “leaderboard” feature allows users to rank and compare chatbot models based on specific criteria.


In conclusion, the paper proposed tailored evaluation metrics for healthcare chatbots, categorizing them into accuracy, trustworthiness, empathy, and computing performance to enhance patient care quality.

In the future, studies implementing the present assessment framework through benchmarks and case studies across medical domains could help address the challenges associated with healthcare chatbots and ultimately improve healthcare delivery.

Journal reference:
Dr. Sushama R. Chaphalkar

Written by

Dr. Sushama R. Chaphalkar

Dr. Sushama R. Chaphalkar is a senior researcher and academician based in Pune, India. She holds a PhD in Microbiology and comes with vast experience in research and education in Biotechnology. In her illustrious career spanning three decades and a half, she held prominent leadership positions in academia and industry. As the Founder-Director of a renowned Biotechnology institute, she worked extensively on high-end research projects of industrial significance, fostering a stronger bond between industry and academia.  


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chaphalkar, Sushama R.. (2024, April 02). Can generative AI truly transform healthcare into a more personalized experience?. News-Medical. Retrieved on April 12, 2024 from

  • MLA

    Chaphalkar, Sushama R.. "Can generative AI truly transform healthcare into a more personalized experience?". News-Medical. 12 April 2024. <>.

  • Chicago

    Chaphalkar, Sushama R.. "Can generative AI truly transform healthcare into a more personalized experience?". News-Medical. (accessed April 12, 2024).

  • Harvard

    Chaphalkar, Sushama R.. 2024. Can generative AI truly transform healthcare into a more personalized experience?. News-Medical, viewed 12 April 2024,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Prioritizing patient outcomes to regulate artificial intelligence in health care