Evaluation Metrics for Health Chatbots: A Delphi Study

Kerstin Denecke; Alaa Abd-Alrazaq; Mowafa Househ; Jim Warren

doi:10.1055/s-0041-1736664

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Download PDF

Methods Inf Med 2021; 60(05/06): 171-179
DOI: 10.1055/s-0041-1736664

Original Article

Evaluation Metrics for Health Chatbots: A Delphi Study

Authors

Kerstin Denecke

¹School of Engineering and Computer Science, Institute for Medical Informatics, Bern University of Applied Sciences, Biel, Switzerland
Alaa Abd-Alrazaq

²Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
Mowafa Househ

²Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
Jim Warren

³Faculty of Science, School of Computer Science, University of Auckland, Auckland, New Zealand

Funding None.

Further Information

Permissions and Reprints

Abstract

Background In recent years, an increasing number of health chatbots has been published in app stores and described in research literature. Given the sensitive data they are processing and the care settings for which they are developed, evaluation is essential to avoid harm to users. However, evaluations of those systems are reported inconsistently and without using a standardized set of evaluation metrics. Missing standards in health chatbot evaluation prevent comparisons of systems, and this may hamper acceptability since their reliability is unclear.

Objectives The objective of this paper is to make an important step toward developing a health-specific chatbot evaluation framework by finding consensus on relevant metrics.

Methods We used an adapted Delphi study design to verify and select potential metrics that we retrieved initially from a scoping review. We invited researchers, health professionals, and health informaticians to score each metric for inclusion in the final evaluation framework, over three survey rounds. We distinguished metrics scored relevant with high, moderate, and low consensus. The initial set of metrics comprised 26 metrics (categorized as global metrics, metrics related to response generation, response understanding and aesthetics).

Results Twenty-eight experts joined the first round and 22 (75%) persisted to the third round. Twenty-four metrics achieved high consensus and three metrics achieved moderate consensus. The core set for our framework comprises mainly global metrics (e.g., ease of use, security content accuracy), metrics related to response generation (e.g., appropriateness of responses), and related to response understanding. Metrics on aesthetics (font type and size, color) are less well agreed upon—only moderate or low consensus was achieved for those metrics.

Conclusion The results indicate that experts largely agree on metrics and that the consensus set is broad. This implies that health chatbot evaluation must be multifaceted to ensure acceptability.

Keywords

health chatbots - conversational agents - performance measures - evaluation framework - Delphi study

Author's Contributions

J.W. and K.D. developed the study concept and protocol. A.A.-A. and K.D. conducted the study with the guidance of M.H. and J.W. A.A.-A. and K.D. drafted the manuscript; A.A.-A. summarized the study results; J.W. and K.D. interpreted the results and drew conclusions. The manuscript was revised critically for important intellectual content by all the authors. All authors approved the manuscript for publication and agreed to be accountable for all the aspects of the work.

Supplementary Material

Supplementary Material (PDF) (opens in new window)

Publication History

Received: 27 June 2021

Accepted: 10 September 2021

Article published online:
31 October 2021

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

References
1 McTear MF, Callejas Z, Griol D. The conversational interface: Talking to Smart Devices. Springer; 2016

Crossref Search in Google Scholar
Download RIS citation
2 Jungmann SM, Klan T, Kuhn S, Jungmann F. Accuracy of a Chatbot (Ada) in the diagnosis of mental disorders: comparative case study with lay and expert users. JMIR Form Res 2019; 3 (04) e13863

Crossref PubMed Search in Google Scholar
Download RIS citation
3 Tschanz M, Dorner TL, Holm J. et al. Using eMMA to manage medication. Computer 2018; 51: 18-25

Crossref Search in Google Scholar
Download RIS citation
4 Siangchin N, Samanchuen T. Chatbot Implementation for ICD-10 Recommendation System. Paper presented at: 2019 International Conference on Engineering, Science, and Industrial Applications (ICESI); 2019

PubMed Search in Google Scholar
Download RIS citation
5 Abd-Alrazaq AA, Alajlani M, Alalwan AA, Bewick BM, Gardner P, Househ M. An overview of the features of chatbots in mental health: a scoping review. Int J Med Inform 2019; 132: 103978

Crossref PubMed Search in Google Scholar
Download RIS citation
6 Abd-Alrazaq AA, Rababeh A, Alajlani M, Bewick BM, Househ M. Effectiveness and safety of using Chatbots to improve mental health: systematic review and meta-analysis. J Med Internet Res 2020; 22 (07) e16021

Crossref PubMed Search in Google Scholar
Download RIS citation
7 Laranjo L, Dunn AG, Tong HL. et al. Conversational agents in healthcare: a systematic review. J Am Med Inform Assoc 2018; 25 (09) 1248-1258

Crossref PubMed Search in Google Scholar
Download RIS citation
8 Vaidyam AN, Wisniewski H, Halamka JD, Kashavan MS, Torous JB. Chatbots and conversational agents in mental health: a review of the psychiatric landscape. Can J Psychiatry 2019; 64 (07) 456-464

Crossref PubMed Search in Google Scholar
Download RIS citation
9 Kocaballi AB, Berkovsky S, Quiroz JC. et al. The personalization of conversational agents in health care: systematic review. J Med Internet Res 2019; 21 (11) e15360

Crossref PubMed Search in Google Scholar
Download RIS citation
10 Abd-Alrazaq A, Safi Z, Alajlani M, Warren J, Househ M, Denecke K. Technical metrics used to evaluate health care chatbots: scoping review. J Med Internet Res 2020; 22 (06) e18301

Crossref PubMed Search in Google Scholar
Download RIS citation
11 Maroengsit W, Piyakulpinyo T, Phonyiam K. et al. A Survey on Evaluation Methods for Chatbots. Paper presented at: Proceedings of the 2019 7 ^th International Conference on Information and Education Technology; 2019 Aizu-Wakamatsu, Japan:

PubMed Search in Google Scholar
Download RIS citation
12 Walker MA, Litman DJ, Kamm CA. et al. PARADISE: a framework for evaluating spoken dialogue agents. Paper presented at: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics; 1997 Madrid, Spain:

PubMed Search in Google Scholar
Download RIS citation
13 Miner AS, Milstein A, Hancock JT. Talking to machines about personal mental health problems. JAMA 2017; 318 (13) 1217-1218

Crossref PubMed Search in Google Scholar
Download RIS citation
14 Sillice MA, Morokoff PJ, Ferszt G. et al. Using relational agents to promote exercise and sun protection: assessment of participants' experiences with two interventions. J Med Internet Res 2018; 20 (02) e48

Crossref PubMed Search in Google Scholar
Download RIS citation
15 Zhang J, Oh YJ, Lange P, Yu Z, Fukuoka Y. Artificial intelligence Chatbot behavior change model for designing artificial intelligence Chatbots to promote physical activity and a healthy diet. J Med Internet Res 2020; 22 (09) e22845

Crossref PubMed Search in Google Scholar
Download RIS citation
16 Shneiderman B, Plaisant C, Cohen M, Jacobs S, Elmqvist N. Designing the User Interface: Strategies for Effective Human-Computer Interaction. 6th ed.. Boston: Pearson; 2018

Search in Google Scholar
Download RIS citation
17 Tractinsky N, Katz AS, Ikar D. What is beautiful is usable. Interact Comput 2000; 13 (02) 127-145

Crossref Search in Google Scholar
Download RIS citation
18 Inkster B, Sarda S, Subramanian V. An empathy-driven, conversational artificial intelligence agent (Wysa) for digital mental well-being: real-world data evaluation mixed-methods study. JMIR Mhealth Uhealth 2018; 6 (11) e12106

Crossref PubMed Search in Google Scholar
Download RIS citation
19 Hensher M, Cooper P, Dona SWA. et al. Scoping review: development and assessment of evaluation frameworks of mobile health apps for recommendations to consumers. J Am Med Inform Assoc 2021; 28 (06) 1318-1329

Crossref PubMed Search in Google Scholar
Download RIS citation
20 Stoyanov SR, Hides L, Kavanagh DJ, Zelenko O, Tjondronegoro D, Mani M. Mobile app rating scale: a new tool for assessing the quality of health mobile apps. JMIR Mhealth Uhealth 2015; 3 (01) e27-e27

Crossref PubMed Search in Google Scholar
Download RIS citation
21 Schnall R, Cho H, Liu J. Health Information Technology Usability Evaluation Scale (Health-ITUES) for usability assessment of mobile health technology: validation study. JMIR Mhealth Uhealth 2018; 6 (01) e4

Crossref PubMed Search in Google Scholar
Download RIS citation
22 Casas J, Tricot M-O, Khaled OA. et al. Trends & Methods in Chatbot Evaluation. Paper presented at: Companion Publication of the 2020 International Conference on Multimodal Interaction; 2020 Virtual Event, Netherlands:

PubMed Search in Google Scholar
Download RIS citation
23 Langevin R, Lordon RJ, Avrahami T. et al. Heuristic Evaluation of Conversational Agents. Paper presented at: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery; 2021: Article 632

Download RIS citation
24 Jadeja M, Varia N. Perspectives for evaluating conversational AI. arXiv preprint arXiv:170904734; 2017

PubMed Search in Google Scholar
Download RIS citation
25 Peras D. Chatbot Evaluation Metrics: Review Paper. Economic and Social Development (Book of Proceedings). In: Veselica R, Dukić G, Hammes K. eds. Zagreb: Varazdin Development and Entrepreneurship Agency, Varazdin, Croatia; 2018: 89-97

Search in Google Scholar
Download RIS citation
26 Venkatesh A, Khatri C, Ram A. et al. On evaluating and comparing open domain dialog systems. arXiv preprint arXiv:180103625; 2018

PubMed Search in Google Scholar
Download RIS citation
27 Atiyah A, Jusoh S, Alghanim F. Evaluation of the Naturalness of Chatbot Applications. Paper presented at: 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT); 2019: 359-365

PubMed Search in Google Scholar
Download RIS citation
28 Shawar BA, Atwell E. Different measurements metrics to evaluate a chatbot system. Paper presented at: Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies; 2007 Rochester, New York:

PubMed Search in Google Scholar
Download RIS citation
29 Chia-Chien H, Brian AS. The Delphi Technique: Use, Considerations, and Applications in the Conventional, Policy, and On-Line Environments. In: Carlos Nunes S. ed. Online Research Methods in Urban and Planning Studies: Design and Outcomes. Hershey, PA: IGI Global; 2012: 173-192

Crossref Search in Google Scholar
Download RIS citation
30 Kelders SM, Kok RN, Ossebaard HC, Van Gemert-Pijnen JE. Persuasive system design does matter: a systematic review of adherence to web-based interventions. J Med Internet Res 2012; 14 (06) e152

Crossref PubMed Search in Google Scholar
Download RIS citation
31 Shum H-y, He X-d, Li D. From Eliza to XiaoIce: challenges and opportunities with social chatbots. Front Inform Technol Electronic Eng 2018; 19: 10-26

Crossref Search in Google Scholar
Download RIS citation
32 Avella JR. Delphi panels: research design, procedures, advantages, and challenges. Int J Dr Stud 2016; 11: 305-321

Search in Google Scholar
Download RIS citation
33 Diamond IR, Grant RC, Feldman BM. et al. Defining consensus: a systematic review recommends methodologic criteria for reporting of Delphi studies. J Clin Epidemiol 2014; 67 (04) 401-409

Crossref PubMed Search in Google Scholar
Download RIS citation
34 Brewer J. Using Combined Expertise to Evaluate Web Accessibility. 2019 . Available at: https://www.w3.org/WAI/test-evaluate/combined-expertise/

PubMed Search in Google Scholar
Download RIS citation
35 Radziwill NM, Benton MC. Evaluating quality of chatbots and intelligent conversational agents. arXiv preprint arXiv:170404579; 2017

PubMed Search in Google Scholar
Download RIS citation
36 Boulkedid R, Abdoul H, Loustau M, Sibony O, Alberti C. Using and reporting the Delphi method for selecting healthcare quality indicators: a systematic review. PLoS One 2011; 6 (06) e20476

Search in Google Scholar
37 Jones J, Hunter D. Consensus methods for medical and health services research. BMJ 1995; 311 (7001): 376-380

Crossref PubMed Search in Google Scholar
Download RIS citation
38 New Zealand Ministry of Health. HISO 10029:2015 Health Information Security Framework. Wellington: Ministry of Health; 2015

Search in Google Scholar
Download RIS citation
39 Nielsen J. Finding usability problems through heuristic evaluation. Paper presented at: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems; 1992 Monterey, California, USA:

PubMed Search in Google Scholar
Download RIS citation
40 Röder M, Both A, Hinneburg A. Exploring the Space of Topic Coherence Measures. Paper presented at: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining; 2015 Shanghai, China:

PubMed Search in Google Scholar
Download RIS citation
41 Maroengsit W, Piyakulpinyo T, Phonyiam K, Pongnumkul S, Chaovalit P, Theeramunkong T. A survey on evaluation methods for chatbots. Paper presented at: Proceedings of the 2019 7th International Conference on Information and Education Technology; 2019, March:111–119

Download RIS citation
42 Bangor A, Kortum PT, Miller JT. An Empirical Evaluation of the System Usability Scale. Int J Hum Comput Interact 2008; 24: 574-594

Crossref Search in Google Scholar
Download RIS citation
43 Davis FD. Perceived usefulness, perceived ease of use, and user acceptance of information technology. Manage Inf Syst Q 1989; 13: 319-340

Crossref Search in Google Scholar
Download RIS citation
44 Hess GI, Fricker G, Denecke K. Improving and evaluating eMMA's communication skills: a Chatbot for managing medication. Stud Health Technol Inform 2019; 259: 101-104

PubMed Search in Google Scholar
Download RIS citation
45 Turunen M, Hakulinen J, Ståhl O. et al. Multimodal and mobile conversational health and fitness companions. Comput Speech Lang 2011; 25: 192-209

Crossref Search in Google Scholar
Download RIS citation
46 Martínez-Miranda J, Martínez A, Ramos R. et al. Assessment of users' acceptability of a mobile-based embodied conversational agent for the prevention and detection of suicidal behaviour. J Med Syst 2019; 43 (08) 246

Crossref PubMed Search in Google Scholar
Download RIS citation

Supplementary Material

Supplementary Material (PDF) (opens in new window)

Related Journals

Subscribe to RSS

Share / Bookmark

Evaluation Metrics for Health Chatbots: A Delphi Study

Authors

Abstract

Keywords

Author's Contributions

Supplementary Material

Publication History

References