Open Access
CC BY 4.0 · Endosc Int Open 2025; 13: a26372163
DOI: 10.1055/a-2637-2163
Original article

GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model

Cem Simsek
1   Gastroenterology & Hepatology, Johns Hopkins Medical Institutions Campus, Baltimore, United States (Ringgold ID: RIN1501)
,
Mete Ucdal
2   internal medicine, Hacettepe University Faculty of Medicine, Ankara, Turkey (Ringgold ID: RIN64005)
,
Enrique de-Madaria
3   Dr Balmis General University Hospital, Alicante, Spain (Ringgold ID: RIN16802)
,
Alanna Ebigbo
4   Division of Gastroenterology, Universitätsklinikum Augsburg, Augsburg, Germany (Ringgold ID: RIN39694)
,
Petr Vanek
5   Palacky University Olomouc, Olomouc, Czech Republic (Ringgold ID: RIN48207)
,
Omar Elshaarawy
6   Liverpool University Hospitals NHS Foundation Trust, Liverpool, United Kingdom of Great Britain and Northern Ireland (Ringgold ID: RIN4595)
7   National Liver Institute, Shebeen El-Kom, Egypt (Ringgold ID: RIN68873)
,
Theodor Alexandru Voiosu
8   Gastroenterology, Colentina Hospital, Bucharest, Romania
,
9   Sapienza University of Rome, Digestive and Liver Disease Unit, Azienda Ospedaliera Sant'Andrea, Roma, Italy (Ringgold ID: RIN117698)
,
Román Turró
10   Endoscopy Unit,, Teknon Medical Center, Barcelona, Spain
,
Javier P Gisbert
11   Division of Gastroenterology, Faculty of Medicine, Hospital Universitario de la Princesa, Madrid, Spain (Ringgold ID: RIN16517)
,
Olga P. Nyssen
12   Hospital Universitario de la Princesa, Madrid, Spain (Ringgold ID: RIN16517)
,
Cesare Hassan
13   Digestive Endoscopy Unit, Humanitas Research Hospital Department of Gastroenterology, Milan, Italy (Ringgold ID: RIN551905)
,
Helmut Messmann
4   Division of Gastroenterology, Universitätsklinikum Augsburg, Augsburg, Germany (Ringgold ID: RIN39694)
,
Rajiv Jalan
14   University College Hospital London Medical School, London, United Kingdom of Great Britain and Northern Ireland (Ringgold ID: RIN9687)
› Institutsangaben
Preview

Abstract

Background and study aims

Current general-purpose artificial intelligence (AI) large language models (LLMs) demonstrate limited efficacy in clinical medicine, often constrained to question-answering, documentation, and literature summarization roles. We developed GastroGPT, a proof-of-concept specialty-specific, multi-task, clinical LLM, and evaluated its performance against leading general-purpose LLMs across key gastroenterology tasks and diverse case scenarios.

Methods

In this structured analysis, GastroGPT was compared with three state-of-the-art general-purpose LLMs (LLM-A: GPT-4, LLM-B: Bard, LLM-C: Claude). Models were assessed on seven clinical tasks and overall performance across 10 simulated gastroenterology cases varying in complexity, frequency, and patient demographics. Standardized prompts facilitated structured comparisons. A blinded expert panel rated model outputs per task on a 10-point Likert scale, judging clinical utility. Comprehensive statistical analyses were conducted.

Results

A total of 2,240 expert ratings were obtained. GastroGPT achieved significantly higher mean overall scores (8.1 ± 1.8) compared with GPT-4 (5.2 ± 3.0), Bard (5.7 ± 3.3), and Claude (7.0 ± 2.7) (all P < 0.001). It outperformed comparators in six of seven tasks (P < 0.05), except follow-up planning. GastroGPT demonstrated superior score consistency (variance 34.95) versus general models (97.4–260.35) (P < 0.001). Its performance remained consistent across case complexities and frequencies, unlike the comparators (P < 0.001). Multivariate analysis revealed that model type significantly predicted performance (P < 0.001).

Conclusions

This study pioneered development and comparison of a specialty-specific, clinically-oriented AI model to general-purpose LLMs. GastroGPT demonstrated superior utility overall and on key gastroenterology tasks, highlighting the potential for tailored, task-focused AI models in medicine.



Publikationsverlauf

Eingereicht: 03. Januar 2025

Angenommen nach Revision: 07. Mai 2025

Accepted Manuscript online:
16. Juni 2025

Artikel online veröffentlicht:
06. August 2025

© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/).

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

Bibliographical Record
Cem Simsek, Mete Ucdal, Enrique de-Madaria, Alanna Ebigbo, Petr Vanek, Omar Elshaarawy, Theodor Alexandru Voiosu, Giulio Antonelli, Román Turró, Javier P Gisbert, Olga P. Nyssen, Cesare Hassan, Helmut Messmann, Rajiv Jalan. GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model. Endosc Int Open 2025; 13: a26372163.
DOI: 10.1055/a-2637-2163
 
  • References

  • 1 Wang H, Fu T, Du Y. et al. Scientific discovery in the age of artificial intelligence. Nature 2023; 620: 47-60
  • 2 Schwalbe N, Wahl B. Artificial intelligence and the future of global health. Lancet 2020; 395: 1579-1586
  • 3 Topol EJ. High-performance medicine: The convergence of human and artificial intelligence. Nat Med 2019; 25: 44-56
  • 4 Topol EJ. As artificial intelligence goes multimodal, medical applications multiply. Science 2023; 381: adk6139
  • 5 Haug CJ, Drazen JM. Artificial intelligence and machine learning in clinical medicine, 2023. N Engl J Med 2023; 388: 1201-1208
  • 6 Dewa CS, Loong D, Bonato S. et al. The relationship between physician burnout and quality of healthcare in terms of safety and acceptability: a systematic review. BMJ Open 2017; 7: e015141
  • 7 Carrillo-Larco RM, Guzman-Vilca WC, Neupane D. Estimating the gap between demand and supply of medical appointments by physicians for hypertension care: a pooled analysis in 191 countries. BMJ Open 2022; 12: e059933
  • 8 Haschka RE, Schley K, Herwartz H. Provision of health care services and regional diversity in Germany: Insights from a Bayesian health frontier analysis with spatial dependencies. Eur J Health Econ 2020; 21: 55-71
  • 9 Jaakkimainen L, Glazier R, Barnsley J. et al. Waiting to see the specialist: patient and provider characteristics of wait times from primary to specialty care. BMC Fam Pract 2014; 15: 1-13
  • 10 Janssen RM, Takach O, Nap-Hill E. et al. Time to endoscopy in patients with colorectal cancer: analysis of wait-times. Can J Gastroenterol Hepatol 2016; 2016: 8714587
  • 11 Jayasooriya N, Baillie S, Blackwell J. et al. Systematic review with meta-analysis: Time to diagnosis and the impact of delayed diagnosis on clinical outcomes in inflammatory bowel disease. Aliment Pharmacol Therap 2023; 57: 635-652
  • 12 Schauer C, Plant A, Vandal A. et al. Outcomes of patients with delayed surveillance colonoscopy. Intern Med J 2022; 52: 1061-1069
  • 13 Thirunavukarasu AJ, Tin DSJ, Elangovan K. et al. Large language models in medicine. Nat Med 2023; 29: 1930-1940
  • 14 Das D, Kumar N, Longjam LA. et al. Assessing the capability of ChatGPT in answering first- and second-order knowledge questions on microbiology as per competency-based medical education curriculum. Cureus 2023; 15: e36034
  • 15 Howard A, Hope W, Gerada A. ChatGPT and antimicrobial advice: the end of the consulting infection doctor?. Lancet Infect Dis 2023; 23: 405-406
  • 16 Jeblick K, Schachtner B, Dexl J. et al. ChatGPT makes medicine easy to swallow: An exploratory case study on simplified radiology reports. Eur Radiol 2023; 34: 2817-2825
  • 17 Nastasi AJ, Courtright KR, Halpern SD. et al. Does ChatGPT provide appropriate and equitable medical advice?: A vignette-based, clinical evaluation across care contexts. medRxiv
  • 18 Antaki F, Touma S, Milad D. et al. Evaluating the performance of ChatGPT in ophthalmology: An analysis of its successes and shortcomings. Ophthalmol Sci 2023; 3: 100324
  • 19 Ayers JW, Poliak A, Dredze M. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 2023; 183: 589-596
  • 20 Yeo YH, Samaan JS, Ng WH. et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023; 29: 721-732
  • 21 Tu R, Ma C, Zhang C. Causal-discovery performance of chatgpt in the context of neuropathic pain diagnosis. arXiv preprint arXiv 2301 2023;
  • 22 Patel SB, Lam K. ChatGPT: the future of discharge summaries?. Lancet Digit Health 2023; 5: e107-e108
  • 23 Yang X, Chen A, PourNejatian N. et al. Gatortron: A large clinical language model to unlock patient information from unstructured electronic health records. arXiv preprint arXiv 2203 2022;
  • 24 Kung TH, Cheatham M, Medenilla A. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2023; 2: e0000198
  • 25 Nov O, Singh N, Mann D. Putting ChatGPT's medical advice to the (Turing) test. arXiv preprint arXiv: 2023: 2301 2023;
  • 26 Nori H, King N, McKinney SM. et al. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv 2303 2023;
  • 27 Singhal K, Tu T, Gotteis J. et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv 2305 2023;
  • 28 Ali S, Shahab O, Shabeeb RA. et al. General purpose large language models match human performance on gastroenterology board exam self-assessments. medRxiv 2023;
  • 29 Suchman K, Garg S, Trindade AJ. Chat Generative Pretrained Transformer fails the Multiple-Choice American College of Gastroenterology Self-Assessment Test. Am J Gastroenterol 2023; 118: 2280-2282