RSS-Feed abonnieren
DOI: 10.1055/s-0045-1804267
Large Language Models Provide Impressive Answers to Complex Questions from Parents in Pediatric Cardiology and Pediatric Cardiac Surgery, But Detecting Errors is Challenging
Background: The complexity of pediatric cardiology cases challenges clinicians not only in decision-making but also in communication with parents. Parents, as nonexperts, heavily rely on detailed discussions with health care providers to participate in decisions. Additionally, parents use the internet to find specific information regarding their child’s health condition. The introduction of generative pretrained transformers (GPTs) based large language models (LLMs) adds a new way to explain complex medical information to different audiences. This study evaluates the quality of GPT LLMs in answering complex medical parent questions and introduces a potential measurement framework to evaluate future LLMs in pediatrics.
Methods: Four expert pediatric cardiologists and pediatric cardiac surgeons generated 19 typical questions as frequently posed by parents. We prompted these questions to GPT 3.5, GPT 4, and a GPT 4 turbo preview. The GPT 4 turbo preview was refined by incorporating the guidelines from the German Society for Pediatric Cardiology using a retrieval function. We prompted the LLMs to provide reliable and empathetic answers tailored to parents as nonexperts. The responses were evaluated based on relevance, factual accuracy, severity of possible harm, completeness, superfluous content, age-related appropriateness, degree of empathy, and understandability from 0 (very bad) to 7 (very good).
Results: Most answers were detailed, extensive, and appeared convincing. Average ratings (ARs) of all LLMs for relevance were 5.9, for factual accuracy 5.4, for severity of possible harm 5.7, for completeness 4.9, and for superfluous content 5.8. Regarding audience-specific tailoring, empathy received an AR of 5.2 and understandability an AR of 6.1. All models had relevant difficulties addressing age-related aspects of the questions (AR 3.7). Concerning potential danger for patients, 5 out of 57 answers received a rating below 4 for factual accuracy (meaning “somewhat incorrect” or worse), and 4 out of 57 answers received a rating below 4 for the severity of potential harm (meaning “moderately harmful” or worse). The inclusion of guideline knowledge did not appear to provide noticeably better answers.
Conclusion: This study highlights the potential and limitations of GPT LLMs in addressing complex questions that parents might ask their physician or look up on the internet. The answers mostly appeared very convincing, making incorrect information harder to detect for nonexperts. With further developments in the future, LLMs might be helpful not only in clinical decision support but also as a useful additional tool for patient education.
#
Die Autoren geben an, dass kein Interessenkonflikt besteht.
Publikationsverlauf
Artikel online veröffentlicht:
11. Februar 2025
© 2025. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany