Eur J Pediatr Surg
DOI: 10.1055/a-2722-3871
Review Article

The Pediatric Surgeon's AI Toolbox: How Large Language Models Like ChatGPT Are Simplifying Practice and Expanding Global Access

Authors

Preview

Abstract

Introduction

Pediatric surgeons face substantial administrative workload. Large language models (LLMs) may streamline documentation, family communication, rapid reference, and education, but raise concerns about accuracy, bias, and privacy. This review summarizes practical, near-term uses with clinician oversight.

Materials and Methods

Narrative review of LLMs in pediatric surgical workflows and scholarly writing. Sources included MEDLINE/PubMed, Scopus, Embase, Google Scholar, and policy documents (WHO, FDA, EU). Searches spanned January 2015 to August 2025, English only. Peer-reviewed and multicenter studies were prioritized; selected high-signal preprints were labeled. Data screening and extraction were performed by the author; findings were synthesized qualitatively.

Results

Across studies, LLMs reduced drafting time for discharge letters and operative note registries while maintaining clinician-rated quality; they improved readability of consent forms and postoperative instructions and supported patient education. For decision support, general models performed well on structured medical questions, with stronger results when grounded by retrieval. Common limits included coding performance, case-nuance/temporal reasoning, variable translation outside high-resource languages, and citation fabrication without curated sources. Privacy risks stemmed from logging, rare-string memorization, and poorly scoped tool connections. Recommended controls included a clinician-in-the-loop “review and release” workflow, privacy-preserving deployments, version pinning, and ongoing monitoring aligned with early-evaluation guidance.

Conclusion

When outputs are grounded in structured EHR data or curated retrieval and briefly reviewed by clinicians, LLMs can responsibly reduce administrative burden and support communication and education. Early adoption should target high-volume, low-risk, auditable tasks. Future priorities must include multicenter pediatric datasets, transparent benchmarks (accuracy, calibration, equity, time saved), and prospective studies linked to safety outcomes.



Publikationsverlauf

Eingereicht: 24. September 2025

Angenommen: 13. Oktober 2025

Accepted Manuscript online:
14. Oktober 2025

Artikel online veröffentlicht:
03. November 2025

© 2025. Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

 
  • References

  • 1 Bouchard ME, Tian Y, Justiniano J. et al. A critical threshold for global pediatric surgical workforce density. Pediatr Surg Int 2021; 37 (09) 1303-1309
  • 2 Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med 2023; 29 (08) 1930-1940
  • 3 Fahrner LJ, Chen E, Topol E, Rajpurkar P. The generative era of medical AI. Cell 2025; 188 (14) 3648-3660
  • 4 Xiao D, Meyers P, Upperman JS, Robinson JR. Revolutionizing healthcare with ChatGPT: an early exploration of an AI language model's impact on medicine at large and its role in pediatric surgery. J Pediatr Surg 2023; 58 (12) 2410-2415
  • 5 González R, Poenaru D, Woo R. et al; Pediatric Surgery ChatGPT Collaborative Group. ChatGPT: what every pediatric surgeon should know about its potential uses and pitfalls. J Pediatr Surg 2024; 59 (05) 941-947
  • 6 Williams CYK, Subramanian CR, Ali SS. et al. Physician- and large language model-generated hospital discharge summaries. JAMA Intern Med 2025; 185 (07) 818-825
  • 7 Ganzinger M, Kunz N, Fuchs P. et al. Automated generation of discharge summaries: leveraging large language models with clinical data. Sci Rep 2025; 15 (01) 16466
  • 8 Heilmeyer F, Böhringer D, Reinhard T, Arens S, Lyssenko L, Haverkamp C. Viability of open large language models for clinical documentation in German health care: real-world model evaluation study. JMIR Med Inform 2024; 12: e59617
  • 9 Balch JA, Desaraju SS, Nolan VJ. et al. Language models for multilabel document classification of surgical concepts in exploratory laparotomy operative notes: algorithm development study. JMIR Med Inform 2025; 13: e71176
  • 10 Soroush A, Glicksberg BS, Zimlichman E. et al. Large language models are poor medical coders—benchmarking of medical code querying. NEJM AI 2024; 1 (05) 2300040
  • 11 Liu TL, Hetherington TC, Dharod A. et al. Does AI-powered clinical documentation enhance clinician efficiency? A longitudinal study. NEJM AI 2024; 1 (12) 2400659
  • 12 Decker H, Trang K, Ramirez J. et al. Large language model-based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Netw Open 2023; 6 (10) e2336997-e2336997
  • 13 Azevedo CB, Martinho AS, Braga I, Nogueira-Silva C, Barroso C, Correia-Pinto J. ChatGPT-4o in enhancing informed consent in pediatric surgical practice. J Pediatr Surg 2025; 60 (09) 162413
  • 14 Wan P, Huang Z, Tang W. et al. Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial. Nat Med 2024; 30 (10) 2878-2885
  • 15 Aghamaliyev U, Karimbayli J, Zamparas A. et al. Bots in white coats: are large language models the future of patient education? A multicenter cross-sectional analysis. Int J Surg 2025; 111 (03) 2376-2384
  • 16 Benboujja F, Hartnick E, Zablah E. et al. Overcoming language barriers in pediatric care: a multilingual, AI-driven curriculum for global healthcare education. Front Public Health 2024; 12: 1337395
  • 17 McDuff D, Schaekermann M, Tu T. et al. Towards accurate differential diagnosis with large language models. Nature 2025; 642 (8067) 451-457
  • 18 Singhal K, Tu T, Gottweis J. et al. Toward expert-level medical question answering with large language models. Nat Med 2025; 31 (03) 943-950
  • 19 Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems. Published online 2023. Accessed at: https://arxiv.org/abs/2303.13375 (preprint)
  • 20 Nori H, Daswani M, Kelly C. et al. Sequential diagnosis with language models. Published online 2025. Accessed at: https://arxiv.org/abs/2506.22405 (preprint)
  • 21 Barile J, Margolis A, Cason G. et al. Diagnostic accuracy of a large language model in pediatric case studies. JAMA Pediatr 2024; 178 (03) 313-315
  • 22 Kim J, Podlasek A, Shidara K, Liu F, Alaa A, Bernardo D. Limitations of large language models in clinical problem-solving arising from inflexible reasoning. Published online 2025. Accessed at: https://arxiv.org/abs/2502.04381 (preprint)
  • 23 Ong CS, Obey NT, Zheng Y, Cohan A, Schneider EB. SurgeryLLM: a retrieval-augmented generation large language model framework for surgical decision support and workflow enhancement. NPJ Digit Med 2024; 7 (01) 364
  • 24 Speer JE, Parker SM, Williams BL. Interactive learning with ChatGPT: hands-on practice and real-time feedback in health sciences education for SMART goal writing. medRxiv . Published online January 1, 2024: 2024.06.11.24308786 (preprint)
  • 25 Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D. The role of large language models in medical education: applications and implications. JMIR Med Educ 2023; 9: e50945
  • 26 Wu J, Liang X, Bai X, Chen Z. SurgBox: agent-driven operating room sandbox with surgery copilot. Published online 2024. Accessed at: https://arxiv.org/abs/2412.05187 (preprint)
  • 27 Zhui L, Yhap N, Liping L. et al. Impact of large language models on medical education and teaching adaptations. JMIR Med Inform 2024; 12: e55933
  • 28 Bernard N, Sagawa Jr Y, Bier N, Lihoreau T, Pazart L, Tannou T. Using artificial intelligence for systematic review: the example of elicit. BMC Med Res Methodol 2025; 25 (01) 75
  • 29 Guo E, Gupta M, Deng J, Park YJ, Paget M, Naugler C. Automated paper screening for clinical reviews using large language models: data analysis study. J Med Internet Res 2024; 26: e48996
  • 30 Adam GP, DeYoung J, Paul A. et al. Literature search sandbox: a large language model that generates search queries for systematic reviews. JAMIA Open 2024; 7 (03) ooae098
  • 31 Xiong G, Jin Q, Lu Z, Zhang A. Benchmarking retrieval-augmented generation for medicine. Published online 2024. Accessed at: https://arxiv.org/abs/2402.13178 (preprint)
  • 32 Li Y, Zhao J, Li M. et al. RefAI: a GPT-powered retrieval-augmented generative tool for biomedical literature recommendation and summarization. J Am Med Inform Assoc 2024; 31 (09) 2030-2039
  • 33 Holland AM, Lorenz WR, Cavanagh JC. et al. Comparison of medical research abstracts written by surgical trainees and senior surgeons or generated by large language models. JAMA Netw Open 2024; 7 (08) e2425373
  • 34 Gao CA, Howard FM, Markov NS. et al. Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers. NPJ Digit Med 2023; 6 (01) 75
  • 35 Li ZQ, Xu HL, Cao HJ, Liu ZL, Fei YT, Liu JP. Use of artificial intelligence in peer review among top 100 medical journals. JAMA Netw Open 2024; 7 (12) e2448609
  • 36 Jobeir B, Alahdal A, Saner F, Staubli S, Broering D, Raptis D. A new frontier in biostatistics: evaluating the accuracy of ChatGPT-4 vs. R in analysing liver resection data. J Glob Health Econ Policy 2024; 4: e2024005
  • 37 Ruta MR, Gaidici T, Irwin C, Lifshitz J. ChatGPT for univariate statistics: validation of AI-assisted data analysis in healthcare research. J Med Internet Res 2025; 27 (01) e63550
  • 38 Ignjatović A, Stevanović L. Efficacy and limitations of ChatGPT as a biostatistical problem-solving tool in medical education in Serbia: a descriptive study. J Educ Eval Health Prof 2023; 20: 28
  • 39 Walters WH, Wilder EI. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep 2023; 13 (01) 14045
  • 40 Gotoman J, Luna H, Sangria J, Santiago Jr CS. Barbuco DD. Accuracy and reliability of AI-generated text detection tools: a literature review. Am J IR 40 Beyond 2025; 4 (01) 1-9
  • 41 Bakker C, Theis-Mahon N, Brown SJ. Evaluating the accuracy of scite, a smart citation index. Hypothesis Res J Health Inf Prof 2023 35. 02
  • 42 Emma P, Divya S, Rajiv M. et al. Using large language models to promote health equity. NEJM AI 2025; 2 (02) AIp2400889
  • 43 Ray M, Kats DJ, Moorkens J. et al. Evaluating a large language model in translating patient instructions to Spanish using a standardized framework. JAMA Pediatr 2025; 179 (09) 1026-1033
  • 44 Brewster RCL, Gonzalez P, Khazanchi R. et al. Performance of ChatGPT and Google Translate for pediatric discharge instruction translation. Pediatrics 2024; 154 (01) e2023065573
  • 45 Kong M, Fernandez A, Bains J. et al. Evaluation of the accuracy and safety of machine translation of patient-specific discharge instructions: a comparative analysis. BMJ Qual Saf 2025; :bmjqs-2024- 018384 . Epub ahead of print.
  • 46 Qiu P, Wu C, Zhang X. et al. Towards building multilingual language model for medicine. Nat Commun 2024; 15 (01) 8384
  • 47 Rodler S, Cei F, Ganjavi C. et al; YAU Collaborators. GPT-4 generates accurate and readable patient education materials aligned with current oncological guidelines: a randomized assessment. PLoS One 2025; 20 (06) e0324175
  • 48 Amano T, Ramírez-Castañeda V, Berdejo-Espinola V. et al. The manifold costs of being a non-native English speaker in science. PLoS Biol 2023; 21 (07) e3002184
  • 49 Khalifa M, Albadawy M. Using artificial intelligence in academic writing and research: An essential productivity tool. Comput Methods Programs Biomed Update 2024; 5: 100145
  • 50 Li J, Zong H, Wu E. et al. Exploring the potential of artificial intelligence to enhance the writing of English academic papers by non-native English-speaking medical students—the educational application of ChatGPT. BMC Med Educ 2024; 24 (01) 736
  • 51 Bai Y, Kosonocky CW, Wang JZ. How our authors are using AI tools in manuscript writing. Patterns (N Y) 2024; 5 (10) 101075
  • 52 Scientific publishing has a language problem. Nat Hum Behav 2023; 7 (07) 1019-1020
  • 53 Wang L, Wan Z, Ni C. et al. Applications and concerns of ChatGPT and other conversational large language models in health care: systematic review. J Med Internet Res 2024; 26: e22769
  • 54 Shokri R, Stronati M, Song C, Shmatikov V. Membership inference attacks against machine learning models. Published online 2017. Accessed at: https://arxiv.org/abs/1610.05820 (preprint)
  • 55 Fredrikson M, Jha S, Ristenpart T. Model inversion attacks that exploit confidence information and basic countermeasures. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. CCS '15. Association for Computing Machinery; 2015: 1322-1333
  • 56 Top OWASP. 10 for Large Language Model Applications | OWASP Foundation. Accessed August 10, 2025 at: https://owasp.org/www-project-top-10-for-large-language-model-applications/
  • 57 Hansson MG, Lochmüller H, Riess O. et al. The risk of re-identification versus the need to identify individuals in rare disease research. Eur J Hum Genet 2016; 24 (11) 1553-1558
  • 58 Sondeck LP, Laurent M. Practical and ready-to-use methodology to assess the re-identification risk in anonymized datasets. Sci Rep 2025; 15 (01) 23223
  • 59 Vasey B, Nagendran M, Campbell B. et al; DECIDE-AI expert group. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med 2022; 28 (05) 924-933
  • 60 U.S. Food and Drug Administration. Health Canada; MHRA. Predetermined Change Control Plans for Machine Learning-Enabled Medical Devices: Guiding Principles. U.S. Food and Drug Administration; Health Canada; MHRA; 2023 . Accessed at: https://www.fda.gov/medical-devices/software-medical-device-samd/predetermined-change-control-plans-machine-learning-enabled-medical-devices-guiding-principles?utm_source=chatgpt.com
  • 61 European Commission DG for R and I.. Living Guidelines on the Responsible Use of Generative AI in Research. Publications Office of the European Union; 2024 . Accessed at: https://research-and-innovation.ec.europa.eu/document/download/2b6cf7e5-36ac-41cb-aab5-0d32050143dc_en?filename=ec_rtd_ai-guidelines.pdf
  • 62 Yang L, Xu S, Sellergren A. et al. Advancing multimodal medical capabilities of Gemini. Published online 2024. Accessed at: https://arxiv.org/abs/2405.03162 (preprint)
  • 63 AlSaad R, Abd-Alrazaq A, Boughorbel S. et al. Multimodal large language models in health care: applications, challenges, and future outlook. J Med Internet Res 2024; 26: e59505
  • 64 Yang HY, Hong SS, Yoon J. et al. Deep learning-based surgical phase recognition in laparoscopic cholecystectomy. Ann Hepatobiliary Pancreat Surg 2024; 28 (04) 466-473
  • 65 Liu Y, Boels M, Garcia-Peraza-Herrera LC. et al. LoViT: Long Video Transformer for surgical phase recognition. Med Image Anal 2025; 99: 103366
  • 66 Holderried F, Stegemann-Philipps C, Herrmann-Werner A. et al. A language model-powered simulated patient with automated feedback for history taking: prospective study. JMIR Med Educ 2024; 10: e59213
  • 67 Hicke Y, Geathers J, Rajashekar N. et al. MedSimAI: simulation and formative feedback generation to enhance deliberate practice in medical education. Published online 2025. Accessed at: https://arxiv.org/abs/2503.05793
  • 68 Plaat A, van Duijn M, van Stein N, Preuss M, van der Putten P, Batenburg KJ. Agentic large language models, a survey. Published online 2025. Accessed at: https://arxiv.org/abs/2503.23037 (preprint)
  • 69 Hou X, Zhao Y, Wang S, Wang H. Model Context Protocol (MCP): landscape, security threats, and future research directions. Published online 2025. Accessed at: https://arxiv.org/abs/2503.23278 (preprint)
  • 70 GPT-5 System Card. August 7, 2025 . Accessed August 7, 2025 at: https://openai.com/index/gpt-5-system-card/
  • 71 World Health Organization. Ethics & Governance of Artificial Intelligence for Health: Guidance on Large Multi-Modal Models. World Health Organization; 2024 https://www.who.int/publications/i/item/9789240084759
  • 72 Rivera SC, Liu X, Chan AW, Denniston AK, Calvert MJ. SPIRIT-AI and CONSORT-AI Working Group; SPIRIT-AI and CONSORT-AI Steering Group; SPIRIT-AI and CONSORT-AI Consensus Group. Guidelines for clinical trial protocols for interventions involving arti!cial intelligence: the SPIRIT-AI extension. Nat Med 2020; 26 (09) 1351-1363
  • 73 Liu X, Rivera SC, Moher D, Calvert MJ, Denniston AK. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI Extension. SPIRIT-AI and, Group CAW, Ashrafian H, et al., eds. BMJ. 2020. ;370.
  • 74 Lepp H, Smith DS. “You Cannot Sound Like GPT”: Signs of language discrimination and resistance in computer science publishing. In: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. FAccT '25. ACM; 2025: 3162-3181
  • 75 Guevara M, Chen S, Thomas S. et al. Large language models to identify social determinants of health in electronic health records. NPJ Digit Med 2024; 7 (01) 6