Subscribe to RSS
DOI: 10.1055/a-2551-2131
Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4, and Experienced Pediatric Surgeons' Performance

Abstract
Introduction
The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the artificial intelligence (AI) models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases.
Methods
We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance, we performed statistical analyses.
Results
ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%) but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (p < 0.01). ChatGPT-4 demonstrated superior performance in generating differential diagnoses compared to Copilot (p < 0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, the recommendations of LLMs were rated as average by pediatric surgeons.
Conclusion
This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.
Keywords
large language models - natural language processing - case studies - pediatric surgery - artificial intelligencePublication History
Received: 11 February 2025
Accepted: 04 March 2025
Accepted Manuscript online:
05 March 2025
Article published online:
02 April 2025
© 2025. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Wu S, Koo M, Blum L. et al. Benchmarking open-source Large Language Models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI 2024; 1 (02) AIdbp2300092
- 2 Lim ZW, Pushpanathan K, Yew SME. et al. Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine 2023; 95: 104770
- 3 González R, Poenaru D, Woo R. et al.; Pediatric Surgery ChatGPT Collaborative Group. ChatGPT: What every pediatric surgeon should know about its potential uses and pitfalls. J Pediatr Surg 2024; 59 (05) 941-947
- 4 Xiao D, Meyers P, Upperman JS, Robinson JR. Revolutionizing Healthcare with ChatGPT: An early exploration of an AI language model's impact on medicine at large and its role in pediatric surgery. J Pediatr Surg 2023; 58 (12) 2410-2415
- 5 OpenAI. GPT-4 technical report. Updated November 22, 2022
- 6 Oztermeli AD, Oztermeli A. ChatGPT performance in the medical specialty exam: An observational study. Medicine (Baltimore) 2023; 102 (32) e34673
- 7 Barbour AB, Barbour TA. A radiation oncology board exam of ChatGPT. Cureus 2023; 15 (09) e44541
- 8 LV K, Yang Y, Liu T, Gao Q, Guo Q, Qiu X. Full parameter fine-tuning for large language models with limited resources. arXiv preprint 2023; arXiv:2306:09782
- 9 Zhao WX, Zhou K, Li J. et al. A survey of large language models. arXiv preprint 2023; arXiv:2303:18223
- 10 Ebel S, Ehrengut C, Denecke T, Gößmann H, Beeskow AB. GPT-4o's competency in answering the simulated written European Board of Interventional Radiology exam compared to a medical student and experts in Germany and its ability to generate exam items on interventional radiology: a descriptive study. J Educ Eval Health Prof 2024; 21: 21
- 11 Marcinkevics R, Reis Wolfertstetter P, Wellmann S, Knorr C, Vogt JE. Using machine learning to predict the diagnosis, management and severity of pediatric appendicitis. Front Pediatr 2021; 9: 662183
- 12 Lure AC, Du X, Black EW. et al. Using machine learning analysis to assist in differentiating between necrotizing enterocolitis and spontaneous intestinal perforation: A novel predictive analytic tool. J Pediatr Surg 2021; 56 (10) 1703-1710
- 13 Khondker A, Kwong JCC, Rickard M. et al. Application of STREAM-URO and APPRAISE-AI reporting standards for artificial intelligence studies in pediatric urology: A case example with pediatric hydronephrosis. J Pediatr Urol 2024; 20 (03) 455-467
- 14 Hsu FR, Dai ST, Chou CM, Huang SY. The application of artificial intelligence to support biliary atresia screening by ultrasound images: A study based on deep learning models. PLoS One 2022; 17 (10) e0276278
- 15 Troesch VL, Wald M, Bonnett MA, Storm DW, Lockwood GM, Cooper CS. The additive impact of the distal ureteral diameter ratio in predicting early breakthrough urinary tract infections in children with vesicoureteral reflux. J Pediatr Urol 2021; 17 (02) 208.e1-208.e5
- 16 Elahmedi M, Sawhney R, Guadagno E, Botelho F, Poenaru D. The state of artificial intelligence in pediatric surgery: A systematic review. J Pediatr Surg 2024; 59 (05) 774-782
- 17 Weegar R. Applying natural language processing to electronic medical records for estimating healthy life expectancy. Lancet Reg Health West Pac 2021; 9: 100132
- 18 Borjali A, Magnéli M, Shin D, Malchau H, Muratoglu OK, Varadarajan KM. Natural language processing with deep learning for medical adverse event detection from free-text medical narratives: A case study of detecting total hip replacement dislocation. Comput Biol Med 2021; 129: 104140
- 19 Fonferko-Shadrach B, Lacey AS, Roberts A. et al. Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system. BMJ Open 2019; 9 (04) e023232
- 20 Bucher BT, Shi J, Ferraro JP. et al. Portable automated surveillance of surgical site infections using natural language processing: development and validation. Ann Surg 2020; 272 (04) 629-636
- 21 Li Y, Li Z, Zhang K, Dan R, Jiang S, Zhang Y. ChatDoctor: A medical chat model fine-tuned on a Large Language Model Meta-AI (LLaMA) using medical domain knowledge. Cureus 2023; 15 (06) e40895
- 22 Parmanto B, Aryoyudanta B, Soekinto TW. et al. A reliable and accessible Caregiving Language Model (CaLM) to support tools for caregivers: Development and evaluation study. JMIR Form Res 2024; 8: e54633
- 23 Reis M, Reis F, Kunde W. Influence of believed AI involvement on the perception of digital medical advice. Nat Med 2024; 30 (11) 3098-3100
- 24 Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med 2023; 6 (01) 120
- 25 Freyer O, Wiest IC, Kather JN, Gilbert S. A future role for health applications of large language models depends on regulators enforcing safety standards. Lancet Digit Health 2024; 6 (09) e662-e672
- 26 Wen Z, Tian Z, Jian Z. et al. Perception of knowledge boundary for large language models through semi-open-ended question answering. arXiv preprint 2024; arXiv:2405:14383