Methods Inf Med 2021; 60(05/06): 171-179
DOI: 10.1055/s-0041-1736664
Original Article

Evaluation Metrics for Health Chatbots: A Delphi Study

Kerstin Denecke
1   School of Engineering and Computer Science, Institute for Medical Informatics, Bern University of Applied Sciences, Biel, Switzerland
Alaa Abd-Alrazaq
2   Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
Mowafa Househ
2   Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
Jim Warren
3   Faculty of Science, School of Computer Science, University of Auckland, Auckland, New Zealand
› Author Affiliations
Funding None.


Background In recent years, an increasing number of health chatbots has been published in app stores and described in research literature. Given the sensitive data they are processing and the care settings for which they are developed, evaluation is essential to avoid harm to users. However, evaluations of those systems are reported inconsistently and without using a standardized set of evaluation metrics. Missing standards in health chatbot evaluation prevent comparisons of systems, and this may hamper acceptability since their reliability is unclear.

Objectives The objective of this paper is to make an important step toward developing a health-specific chatbot evaluation framework by finding consensus on relevant metrics.

Methods We used an adapted Delphi study design to verify and select potential metrics that we retrieved initially from a scoping review. We invited researchers, health professionals, and health informaticians to score each metric for inclusion in the final evaluation framework, over three survey rounds. We distinguished metrics scored relevant with high, moderate, and low consensus. The initial set of metrics comprised 26 metrics (categorized as global metrics, metrics related to response generation, response understanding and aesthetics).

Results Twenty-eight experts joined the first round and 22 (75%) persisted to the third round. Twenty-four metrics achieved high consensus and three metrics achieved moderate consensus. The core set for our framework comprises mainly global metrics (e.g., ease of use, security content accuracy), metrics related to response generation (e.g., appropriateness of responses), and related to response understanding. Metrics on aesthetics (font type and size, color) are less well agreed upon—only moderate or low consensus was achieved for those metrics.

Conclusion The results indicate that experts largely agree on metrics and that the consensus set is broad. This implies that health chatbot evaluation must be multifaceted to ensure acceptability.

Author's Contributions

J.W. and K.D. developed the study concept and protocol. A.A.-A. and K.D. conducted the study with the guidance of M.H. and J.W. A.A.-A. and K.D. drafted the manuscript; A.A.-A. summarized the study results; J.W. and K.D. interpreted the results and drew conclusions. The manuscript was revised critically for important intellectual content by all the authors. All authors approved the manuscript for publication and agreed to be accountable for all the aspects of the work.

Supplementary Material

Publication History

Received: 27 June 2021

Accepted: 10 September 2021

Article published online:
31 October 2021

© 2021. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

  • References

  • 1 McTear MF, Callejas Z, Griol D. The conversational interface: Talking to Smart Devices. Springer; 2016
  • 2 Jungmann SM, Klan T, Kuhn S, Jungmann F. Accuracy of a Chatbot (Ada) in the diagnosis of mental disorders: comparative case study with lay and expert users. JMIR Form Res 2019; 3 (04) e13863
  • 3 Tschanz M, Dorner TL, Holm J. et al. Using eMMA to manage medication. Computer 2018; 51: 18-25
  • 4 Siangchin N, Samanchuen T. Chatbot Implementation for ICD-10 Recommendation System. Paper presented at: 2019 International Conference on Engineering, Science, and Industrial Applications (ICESI); 2019
  • 5 Abd-Alrazaq AA, Alajlani M, Alalwan AA, Bewick BM, Gardner P, Househ M. An overview of the features of chatbots in mental health: a scoping review. Int J Med Inform 2019; 132: 103978
  • 6 Abd-Alrazaq AA, Rababeh A, Alajlani M, Bewick BM, Househ M. Effectiveness and safety of using Chatbots to improve mental health: systematic review and meta-analysis. J Med Internet Res 2020; 22 (07) e16021
  • 7 Laranjo L, Dunn AG, Tong HL. et al. Conversational agents in healthcare: a systematic review. J Am Med Inform Assoc 2018; 25 (09) 1248-1258
  • 8 Vaidyam AN, Wisniewski H, Halamka JD, Kashavan MS, Torous JB. Chatbots and conversational agents in mental health: a review of the psychiatric landscape. Can J Psychiatry 2019; 64 (07) 456-464
  • 9 Kocaballi AB, Berkovsky S, Quiroz JC. et al. The personalization of conversational agents in health care: systematic review. J Med Internet Res 2019; 21 (11) e15360
  • 10 Abd-Alrazaq A, Safi Z, Alajlani M, Warren J, Househ M, Denecke K. Technical metrics used to evaluate health care chatbots: scoping review. J Med Internet Res 2020; 22 (06) e18301
  • 11 Maroengsit W, Piyakulpinyo T, Phonyiam K. et al. A Survey on Evaluation Methods for Chatbots. Paper presented at: Proceedings of the 2019 7 th International Conference on Information and Education Technology; 2019 Aizu-Wakamatsu, Japan:
  • 12 Walker MA, Litman DJ, Kamm CA. et al. PARADISE: a framework for evaluating spoken dialogue agents. Paper presented at: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics; 1997 Madrid, Spain:
  • 13 Miner AS, Milstein A, Hancock JT. Talking to machines about personal mental health problems. JAMA 2017; 318 (13) 1217-1218
  • 14 Sillice MA, Morokoff PJ, Ferszt G. et al. Using relational agents to promote exercise and sun protection: assessment of participants' experiences with two interventions. J Med Internet Res 2018; 20 (02) e48
  • 15 Zhang J, Oh YJ, Lange P, Yu Z, Fukuoka Y. Artificial intelligence Chatbot behavior change model for designing artificial intelligence Chatbots to promote physical activity and a healthy diet. J Med Internet Res 2020; 22 (09) e22845
  • 16 Shneiderman B, Plaisant C, Cohen M, Jacobs S, Elmqvist N. Designing the User Interface: Strategies for Effective Human-Computer Interaction. 6th ed.. Boston: Pearson; 2018
  • 17 Tractinsky N, Katz AS, Ikar D. What is beautiful is usable. Interact Comput 2000; 13 (02) 127-145
  • 18 Inkster B, Sarda S, Subramanian V. An empathy-driven, conversational artificial intelligence agent (Wysa) for digital mental well-being: real-world data evaluation mixed-methods study. JMIR Mhealth Uhealth 2018; 6 (11) e12106
  • 19 Hensher M, Cooper P, Dona SWA. et al. Scoping review: development and assessment of evaluation frameworks of mobile health apps for recommendations to consumers. J Am Med Inform Assoc 2021; 28 (06) 1318-1329
  • 20 Stoyanov SR, Hides L, Kavanagh DJ, Zelenko O, Tjondronegoro D, Mani M. Mobile app rating scale: a new tool for assessing the quality of health mobile apps. JMIR Mhealth Uhealth 2015; 3 (01) e27-e27
  • 21 Schnall R, Cho H, Liu J. Health Information Technology Usability Evaluation Scale (Health-ITUES) for usability assessment of mobile health technology: validation study. JMIR Mhealth Uhealth 2018; 6 (01) e4
  • 22 Casas J, Tricot M-O, Khaled OA. et al. Trends & Methods in Chatbot Evaluation. Paper presented at: Companion Publication of the 2020 International Conference on Multimodal Interaction; 2020 Virtual Event, Netherlands:
  • 23 Langevin R, Lordon RJ, Avrahami T. et al. Heuristic Evaluation of Conversational Agents. Paper presented at: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery; 2021: Article 632
  • 24 Jadeja M, Varia N. Perspectives for evaluating conversational AI. arXiv preprint arXiv:170904734; 2017
  • 25 Peras D. Chatbot Evaluation Metrics: Review Paper. Economic and Social Development (Book of Proceedings). In: Veselica R, Dukić G, Hammes K. eds. Zagreb: Varazdin Development and Entrepreneurship Agency, Varazdin, Croatia; 2018: 89-97
  • 26 Venkatesh A, Khatri C, Ram A. et al. On evaluating and comparing open domain dialog systems. arXiv preprint arXiv:180103625; 2018
  • 27 Atiyah A, Jusoh S, Alghanim F. Evaluation of the Naturalness of Chatbot Applications. Paper presented at: 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT); 2019: 359-365
  • 28 Shawar BA, Atwell E. Different measurements metrics to evaluate a chatbot system. Paper presented at: Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies; 2007 Rochester, New York:
  • 29 Chia-Chien H, Brian AS. The Delphi Technique: Use, Considerations, and Applications in the Conventional, Policy, and On-Line Environments. In: Carlos Nunes S. ed. Online Research Methods in Urban and Planning Studies: Design and Outcomes. Hershey, PA: IGI Global; 2012: 173-192
  • 30 Kelders SM, Kok RN, Ossebaard HC, Van Gemert-Pijnen JE. Persuasive system design does matter: a systematic review of adherence to web-based interventions. J Med Internet Res 2012; 14 (06) e152
  • 31 Shum H-y, He X-d, Li D. From Eliza to XiaoIce: challenges and opportunities with social chatbots. Front Inform Technol Electronic Eng 2018; 19: 10-26
  • 32 Avella JR. Delphi panels: research design, procedures, advantages, and challenges. Int J Dr Stud 2016; 11: 305-321
  • 33 Diamond IR, Grant RC, Feldman BM. et al. Defining consensus: a systematic review recommends methodologic criteria for reporting of Delphi studies. J Clin Epidemiol 2014; 67 (04) 401-409
  • 34 Brewer J. Using Combined Expertise to Evaluate Web Accessibility. 2019 . Available at:
  • 35 Radziwill NM, Benton MC. Evaluating quality of chatbots and intelligent conversational agents. arXiv preprint arXiv:170404579; 2017
  • 36 Boulkedid R, Abdoul H, Loustau M, Sibony O, Alberti C. Using and reporting the Delphi method for selecting healthcare quality indicators: a systematic review. PLoS One 2011; 6 (06) e20476
  • 37 Jones J, Hunter D. Consensus methods for medical and health services research. BMJ 1995; 311 (7001): 376-380
  • 38 New Zealand Ministry of Health. HISO 10029:2015 Health Information Security Framework. Wellington: Ministry of Health; 2015
  • 39 Nielsen J. Finding usability problems through heuristic evaluation. Paper presented at: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems; 1992 Monterey, California, USA:
  • 40 Röder M, Both A, Hinneburg A. Exploring the Space of Topic Coherence Measures. Paper presented at: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining; 2015 Shanghai, China:
  • 41 Maroengsit W, Piyakulpinyo T, Phonyiam K, Pongnumkul S, Chaovalit P, Theeramunkong T. A survey on evaluation methods for chatbots. Paper presented at: Proceedings of the 2019 7th International Conference on Information and Education Technology; 2019, March:111–119
  • 42 Bangor A, Kortum PT, Miller JT. An Empirical Evaluation of the System Usability Scale. Int J Hum Comput Interact 2008; 24: 574-594
  • 43 Davis FD. Perceived usefulness, perceived ease of use, and user acceptance of information technology. Manage Inf Syst Q 1989; 13: 319-340
  • 44 Hess GI, Fricker G, Denecke K. Improving and evaluating eMMA's communication skills: a Chatbot for managing medication. Stud Health Technol Inform 2019; 259: 101-104
  • 45 Turunen M, Hakulinen J, Ståhl O. et al. Multimodal and mobile conversational health and fitness companions. Comput Speech Lang 2011; 25: 192-209
  • 46 Martínez-Miranda J, Martínez A, Ramos R. et al. Assessment of users' acceptability of a mobile-based embodied conversational agent for the prevention and detection of suicidal behaviour. J Med Syst 2019; 43 (08) 246