GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model

Cem Simsek; Mete Ucdal; Enrique de-Madaria; Alanna Ebigbo; Petr Vanek; Omar Elshaarawy; Theodor Alexandru Voiosu; Giulio Antonelli; Román Turró; Javier P Gisbert; Olga P. Nyssen; Cesare Hassan; Helmut Messmann; Rajiv Jalan

doi:10.1055/a-2637-2163

RSS-Feed abonnieren

Bitte kopieren Sie die angezeigte URL und fügen sie dann in Ihren RSS-Reader ein.

https://www.thieme-connect.de/rss/thieme/de/10.1055-s-00025476.xml

Teilen / Bookmarken

Facebook Linkedin Weibo

PDF herunterladen

CC BY 4.0 · Endosc Int Open 2025; 13: a26372163
DOI: 10.1055/a-2637-2163

Original article

GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model

Cem Simsek

¹Gastroenterology & Hepatology, Johns Hopkins Medical Institutions Campus, Baltimore, United States (Ringgold ID: RIN1501)

,

Mete Ucdal

²internal medicine, Hacettepe University Faculty of Medicine, Ankara, Turkey (Ringgold ID: RIN64005)

,

Enrique de-Madaria

³Dr Balmis General University Hospital, Alicante, Spain (Ringgold ID: RIN16802)

,

Alanna Ebigbo

⁴Division of Gastroenterology, Universitätsklinikum Augsburg, Augsburg, Germany (Ringgold ID: RIN39694)

,

Petr Vanek

⁵Palacky University Olomouc, Olomouc, Czech Republic (Ringgold ID: RIN48207)

,

Omar Elshaarawy

⁶Liverpool University Hospitals NHS Foundation Trust, Liverpool, United Kingdom of Great Britain and Northern Ireland (Ringgold ID: RIN4595)

⁷National Liver Institute, Shebeen El-Kom, Egypt (Ringgold ID: RIN68873)

,

Theodor Alexandru Voiosu

⁸Gastroenterology, Colentina Hospital, Bucharest, Romania

,

Giulio Antonelli

⁹Sapienza University of Rome, Digestive and Liver Disease Unit, Azienda Ospedaliera Sant'Andrea, Roma, Italy (Ringgold ID: RIN117698)

,

Román Turró

¹⁰Endoscopy Unit,, Teknon Medical Center, Barcelona, Spain

,

Javier P Gisbert

¹¹Division of Gastroenterology, Faculty of Medicine, Hospital Universitario de la Princesa, Madrid, Spain (Ringgold ID: RIN16517)

,

Olga P. Nyssen

¹²Hospital Universitario de la Princesa, Madrid, Spain (Ringgold ID: RIN16517)

,

Cesare Hassan

¹³Digestive Endoscopy Unit, Humanitas Research Hospital Department of Gastroenterology, Milan, Italy (Ringgold ID: RIN551905)

,

Helmut Messmann

⁴Division of Gastroenterology, Universitätsklinikum Augsburg, Augsburg, Germany (Ringgold ID: RIN39694)

,

Rajiv Jalan

¹⁴University College Hospital London Medical School, London, United Kingdom of Great Britain and Northern Ireland (Ringgold ID: RIN9687)

› Institutsangaben

› Weitere Informationen

Auch verfügbar auf

Abstract
Volltext
Referenzen

Lizenzen und Reprints

Abstract

Background and study aims

Current general-purpose artificial intelligence (AI) large language models (LLMs) demonstrate limited efficacy in clinical medicine, often constrained to question-answering, documentation, and literature summarization roles. We developed GastroGPT, a proof-of-concept specialty-specific, multi-task, clinical LLM, and evaluated its performance against leading general-purpose LLMs across key gastroenterology tasks and diverse case scenarios.

Methods

In this structured analysis, GastroGPT was compared with three state-of-the-art general-purpose LLMs (LLM-A: GPT-4, LLM-B: Bard, LLM-C: Claude). Models were assessed on seven clinical tasks and overall performance across 10 simulated gastroenterology cases varying in complexity, frequency, and patient demographics. Standardized prompts facilitated structured comparisons. A blinded expert panel rated model outputs per task on a 10-point Likert scale, judging clinical utility. Comprehensive statistical analyses were conducted.

Results

A total of 2,240 expert ratings were obtained. GastroGPT achieved significantly higher mean overall scores (8.1 ± 1.8) compared with GPT-4 (5.2 ± 3.0), Bard (5.7 ± 3.3), and Claude (7.0 ± 2.7) (all P < 0.001). It outperformed comparators in six of seven tasks (P < 0.05), except follow-up planning. GastroGPT demonstrated superior score consistency (variance 34.95) versus general models (97.4–260.35) (P < 0.001). Its performance remained consistent across case complexities and frequencies, unlike the comparators (P < 0.001). Multivariate analysis revealed that model type significantly predicted performance (P < 0.001).

Conclusions

This study pioneered development and comparison of a specialty-specific, clinically-oriented AI model to general-purpose LLMs. GastroGPT demonstrated superior utility overall and on key gastroenterology tasks, highlighting the potential for tailored, task-focused AI models in medicine.

Keywords

Endoscopy Upper GI Tract - Reflux disease - Endoscopy Small Bowel - Inflammatory bowel disease - Neoplasia - Non-variceal bleeding - Pancreatobiliary (ERCP/PTCD)

Publikationsverlauf

Eingereicht: 03. Januar 2025

Angenommen nach Revision: 07. Mai 2025

Accepted Manuscript online:
16. Juni 2025

Artikel online veröffentlicht:
06. August 2025

© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/).

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

Bibliographical Record
Cem Simsek, Mete Ucdal, Enrique de-Madaria, Alanna Ebigbo, Petr Vanek, Omar Elshaarawy, Theodor Alexandru Voiosu, Giulio Antonelli, Román Turró, Javier P Gisbert, Olga P. Nyssen, Cesare Hassan, Helmut Messmann, Rajiv Jalan. GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model. Endosc Int Open 2025; 13: a26372163.
DOI: 10.1055/a-2637-2163

References
1 Wang H, Fu T, Du Y. et al. Scientific discovery in the age of artificial intelligence. Nature 2023; 620: 47-60

MissingFormLabel

Suche in Google Scholar
2 Schwalbe N, Wahl B. Artificial intelligence and the future of global health. Lancet 2020; 395: 1579-1586

MissingFormLabel

Suche in Google Scholar
3 Topol EJ. High-performance medicine: The convergence of human and artificial intelligence. Nat Med 2019; 25: 44-56

MissingFormLabel

Suche in Google Scholar
4 Topol EJ. As artificial intelligence goes multimodal, medical applications multiply. Science 2023; 381: adk6139

MissingFormLabel

Suche in Google Scholar
5 Haug CJ, Drazen JM. Artificial intelligence and machine learning in clinical medicine, 2023. N Engl J Med 2023; 388: 1201-1208

MissingFormLabel

Suche in Google Scholar
6 Dewa CS, Loong D, Bonato S. et al. The relationship between physician burnout and quality of healthcare in terms of safety and acceptability: a systematic review. BMJ Open 2017; 7: e015141

MissingFormLabel

Suche in Google Scholar
7 Carrillo-Larco RM, Guzman-Vilca WC, Neupane D. Estimating the gap between demand and supply of medical appointments by physicians for hypertension care: a pooled analysis in 191 countries. BMJ Open 2022; 12: e059933

MissingFormLabel

Suche in Google Scholar
8 Haschka RE, Schley K, Herwartz H. Provision of health care services and regional diversity in Germany: Insights from a Bayesian health frontier analysis with spatial dependencies. Eur J Health Econ 2020; 21: 55-71

MissingFormLabel

Suche in Google Scholar
9 Jaakkimainen L, Glazier R, Barnsley J. et al. Waiting to see the specialist: patient and provider characteristics of wait times from primary to specialty care. BMC Fam Pract 2014; 15: 1-13

MissingFormLabel

Suche in Google Scholar
10 Janssen RM, Takach O, Nap-Hill E. et al. Time to endoscopy in patients with colorectal cancer: analysis of wait-times. Can J Gastroenterol Hepatol 2016; 2016: 8714587

MissingFormLabel

Suche in Google Scholar
11 Jayasooriya N, Baillie S, Blackwell J. et al. Systematic review with meta-analysis: Time to diagnosis and the impact of delayed diagnosis on clinical outcomes in inflammatory bowel disease. Aliment Pharmacol Therap 2023; 57: 635-652

MissingFormLabel

Suche in Google Scholar
12 Schauer C, Plant A, Vandal A. et al. Outcomes of patients with delayed surveillance colonoscopy. Intern Med J 2022; 52: 1061-1069

MissingFormLabel

Suche in Google Scholar
13 Thirunavukarasu AJ, Tin DSJ, Elangovan K. et al. Large language models in medicine. Nat Med 2023; 29: 1930-1940

MissingFormLabel

Suche in Google Scholar
14 Das D, Kumar N, Longjam LA. et al. Assessing the capability of ChatGPT in answering first- and second-order knowledge questions on microbiology as per competency-based medical education curriculum. Cureus 2023; 15: e36034

MissingFormLabel

Suche in Google Scholar
15 Howard A, Hope W, Gerada A. ChatGPT and antimicrobial advice: the end of the consulting infection doctor?. Lancet Infect Dis 2023; 23: 405-406

MissingFormLabel

Suche in Google Scholar
16 Jeblick K, Schachtner B, Dexl J. et al. ChatGPT makes medicine easy to swallow: An exploratory case study on simplified radiology reports. Eur Radiol 2023; 34: 2817-2825

MissingFormLabel

Suche in Google Scholar
17 Nastasi AJ, Courtright KR, Halpern SD. et al. Does ChatGPT provide appropriate and equitable medical advice?: A vignette-based, clinical evaluation across care contexts. medRxiv

MissingFormLabel
18 Antaki F, Touma S, Milad D. et al. Evaluating the performance of ChatGPT in ophthalmology: An analysis of its successes and shortcomings. Ophthalmol Sci 2023; 3: 100324

MissingFormLabel

Suche in Google Scholar
19 Ayers JW, Poliak A, Dredze M. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 2023; 183: 589-596

MissingFormLabel

Suche in Google Scholar
20 Yeo YH, Samaan JS, Ng WH. et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023; 29: 721-732

MissingFormLabel

Suche in Google Scholar
21 Tu R, Ma C, Zhang C. Causal-discovery performance of chatgpt in the context of neuropathic pain diagnosis. arXiv preprint arXiv 2301 2023;

MissingFormLabel

Suche in Google Scholar
22 Patel SB, Lam K. ChatGPT: the future of discharge summaries?. Lancet Digit Health 2023; 5: e107-e108

MissingFormLabel

Suche in Google Scholar
23 Yang X, Chen A, PourNejatian N. et al. Gatortron: A large clinical language model to unlock patient information from unstructured electronic health records. arXiv preprint arXiv 2203 2022;

MissingFormLabel

Suche in Google Scholar
24 Kung TH, Cheatham M, Medenilla A. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2023; 2: e0000198

MissingFormLabel

Suche in Google Scholar
25 Nov O, Singh N, Mann D. Putting ChatGPT's medical advice to the (Turing) test. arXiv preprint arXiv: 2023: 2301 2023;

MissingFormLabel

Suche in Google Scholar
26 Nori H, King N, McKinney SM. et al. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv 2303 2023;

MissingFormLabel

Suche in Google Scholar
27 Singhal K, Tu T, Gotteis J. et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv 2305 2023;

MissingFormLabel

Suche in Google Scholar
28 Ali S, Shahab O, Shabeeb RA. et al. General purpose large language models match human performance on gastroenterology board exam self-assessments. medRxiv 2023;

MissingFormLabel

Suche in Google Scholar
29 Suchman K, Garg S, Trindade AJ. Chat Generative Pretrained Transformer fails the Multiple-Choice American College of Gastroenterology Self-Assessment Test. Am J Gastroenterol 2023; 118: 2280-2282

MissingFormLabel

Suche in Google Scholar

RSS-Feed abonnieren

Teilen / Bookmarken

GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model

Einführung zu diesem Artikel:

Abstract

Background and study aims

Methods

Results

Conclusions

Keywords

Publikationsverlauf

References