RSS-Feed abonnieren

DOI: 10.1055/a-2637-2163
GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model

Abstract
Background and study aims
Current general-purpose artificial intelligence (AI) large language models (LLMs) demonstrate limited efficacy in clinical medicine, often constrained to question-answering, documentation, and literature summarization roles. We developed GastroGPT, a proof-of-concept specialty-specific, multi-task, clinical LLM, and evaluated its performance against leading general-purpose LLMs across key gastroenterology tasks and diverse case scenarios.
Methods
In this structured analysis, GastroGPT was compared with three state-of-the-art general-purpose LLMs (LLM-A: GPT-4, LLM-B: Bard, LLM-C: Claude). Models were assessed on seven clinical tasks and overall performance across 10 simulated gastroenterology cases varying in complexity, frequency, and patient demographics. Standardized prompts facilitated structured comparisons. A blinded expert panel rated model outputs per task on a 10-point Likert scale, judging clinical utility. Comprehensive statistical analyses were conducted.
Results
A total of 2,240 expert ratings were obtained. GastroGPT achieved significantly higher mean overall scores (8.1 ± 1.8) compared with GPT-4 (5.2 ± 3.0), Bard (5.7 ± 3.3), and Claude (7.0 ± 2.7) (all P < 0.001). It outperformed comparators in six of seven tasks (P < 0.05), except follow-up planning. GastroGPT demonstrated superior score consistency (variance 34.95) versus general models (97.4–260.35) (P < 0.001). Its performance remained consistent across case complexities and frequencies, unlike the comparators (P < 0.001). Multivariate analysis revealed that model type significantly predicted performance (P < 0.001).
Conclusions
This study pioneered development and comparison of a specialty-specific, clinically-oriented AI model to general-purpose LLMs. GastroGPT demonstrated superior utility overall and on key gastroenterology tasks, highlighting the potential for tailored, task-focused AI models in medicine.
Keywords
Endoscopy Upper GI Tract - Reflux disease - Endoscopy Small Bowel - Inflammatory bowel disease - Neoplasia - Non-variceal bleeding - Pancreatobiliary (ERCP/PTCD)Publikationsverlauf
Eingereicht: 03. Januar 2025
Angenommen nach Revision: 07. Mai 2025
Accepted Manuscript online:
16. Juni 2025
Artikel online veröffentlicht:
06. August 2025
© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/).
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
Cem Simsek, Mete Ucdal, Enrique de-Madaria, Alanna Ebigbo, Petr Vanek, Omar Elshaarawy, Theodor Alexandru Voiosu, Giulio Antonelli, Román Turró, Javier P Gisbert, Olga P. Nyssen, Cesare Hassan, Helmut Messmann, Rajiv Jalan. GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model. Endosc Int Open 2025; 13: a26372163.
DOI: 10.1055/a-2637-2163
-
References
- 1
Wang H,
Fu T,
Du Y.
et al.
Scientific discovery in the age of artificial intelligence. Nature 2023; 620: 47-60
MissingFormLabel
- 2
Schwalbe N,
Wahl B.
Artificial intelligence and the future of global health. Lancet 2020; 395: 1579-1586
MissingFormLabel
- 3
Topol EJ.
High-performance medicine: The convergence of human and artificial intelligence. Nat
Med 2019; 25: 44-56
MissingFormLabel
- 4
Topol EJ.
As artificial intelligence goes multimodal, medical applications multiply. Science
2023; 381: adk6139
MissingFormLabel
- 5
Haug CJ,
Drazen JM.
Artificial intelligence and machine learning in clinical medicine, 2023. N Engl J
Med 2023; 388: 1201-1208
MissingFormLabel
- 6
Dewa CS,
Loong D,
Bonato S.
et al.
The relationship between physician burnout and quality of healthcare in terms of safety
and acceptability: a systematic review. BMJ Open 2017; 7: e015141
MissingFormLabel
- 7
Carrillo-Larco RM,
Guzman-Vilca WC,
Neupane D.
Estimating the gap between demand and supply of medical appointments by physicians
for hypertension care: a pooled analysis in 191 countries. BMJ Open 2022; 12: e059933
MissingFormLabel
- 8
Haschka RE,
Schley K,
Herwartz H.
Provision of health care services and regional diversity in Germany: Insights from
a Bayesian health frontier analysis with spatial dependencies. Eur J Health Econ 2020;
21: 55-71
MissingFormLabel
- 9
Jaakkimainen L,
Glazier R,
Barnsley J.
et al.
Waiting to see the specialist: patient and provider characteristics of wait times
from primary to specialty care. BMC Fam Pract 2014; 15: 1-13
MissingFormLabel
- 10
Janssen RM,
Takach O,
Nap-Hill E.
et al.
Time to endoscopy in patients with colorectal cancer: analysis of wait-times. Can
J Gastroenterol Hepatol 2016; 2016: 8714587
MissingFormLabel
- 11
Jayasooriya N,
Baillie S,
Blackwell J.
et al.
Systematic review with meta-analysis: Time to diagnosis and the impact of delayed
diagnosis on clinical outcomes in inflammatory bowel disease. Aliment Pharmacol Therap
2023; 57: 635-652
MissingFormLabel
- 12
Schauer C,
Plant A,
Vandal A.
et al.
Outcomes of patients with delayed surveillance colonoscopy. Intern Med J 2022; 52:
1061-1069
MissingFormLabel
- 13
Thirunavukarasu AJ,
Tin DSJ,
Elangovan K.
et al.
Large language models in medicine. Nat Med 2023; 29: 1930-1940
MissingFormLabel
- 14
Das D,
Kumar N,
Longjam LA.
et al.
Assessing the capability of ChatGPT in answering first- and second-order knowledge
questions on microbiology as per competency-based medical education curriculum. Cureus
2023; 15: e36034
MissingFormLabel
- 15
Howard A,
Hope W,
Gerada A.
ChatGPT and antimicrobial advice: the end of the consulting infection doctor?. Lancet
Infect Dis 2023; 23: 405-406
MissingFormLabel
- 16
Jeblick K,
Schachtner B,
Dexl J.
et al.
ChatGPT makes medicine easy to swallow: An exploratory case study on simplified radiology
reports. Eur Radiol 2023; 34: 2817-2825
MissingFormLabel
- 17
Nastasi AJ,
Courtright KR,
Halpern SD.
et al.
Does ChatGPT provide appropriate and equitable medical advice?: A vignette-based,
clinical evaluation across care contexts. medRxiv
MissingFormLabel
- 18
Antaki F,
Touma S,
Milad D.
et al.
Evaluating the performance of ChatGPT in ophthalmology: An analysis of its successes
and shortcomings. Ophthalmol Sci 2023; 3: 100324
MissingFormLabel
- 19
Ayers JW,
Poliak A,
Dredze M.
et al.
Comparing physician and artificial intelligence chatbot responses to patient questions
posted to a public social media forum. JAMA Intern Med 2023; 183: 589-596
MissingFormLabel
- 20
Yeo YH,
Samaan JS,
Ng WH.
et al.
Assessing the performance of ChatGPT in answering questions regarding cirrhosis and
hepatocellular carcinoma. Clin Mol Hepatol 2023; 29: 721-732
MissingFormLabel
- 21
Tu R,
Ma C,
Zhang C.
Causal-discovery performance of chatgpt in the context of neuropathic pain diagnosis.
arXiv preprint arXiv 2301 2023;
MissingFormLabel
- 22
Patel SB,
Lam K.
ChatGPT: the future of discharge summaries?. Lancet Digit Health 2023; 5: e107-e108
MissingFormLabel
- 23
Yang X,
Chen A,
PourNejatian N.
et al.
Gatortron: A large clinical language model to unlock patient information from unstructured
electronic health records. arXiv preprint arXiv 2203 2022;
MissingFormLabel
- 24
Kung TH,
Cheatham M,
Medenilla A.
et al.
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using
large language models. PLOS Digit Health 2023; 2: e0000198
MissingFormLabel
- 25
Nov O,
Singh N,
Mann D.
Putting ChatGPT's medical advice to the (Turing) test. arXiv preprint arXiv: 2023:
2301 2023;
MissingFormLabel
- 26
Nori H,
King N,
McKinney SM.
et al.
Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv 2303 2023;
MissingFormLabel
- 27
Singhal K,
Tu T,
Gotteis J.
et al.
Towards expert-level medical question answering with large language models. arXiv
preprint arXiv 2305 2023;
MissingFormLabel
- 28
Ali S,
Shahab O,
Shabeeb RA.
et al.
General purpose large language models match human performance on gastroenterology
board exam self-assessments. medRxiv 2023;
MissingFormLabel
- 29
Suchman K,
Garg S,
Trindade AJ.
Chat Generative Pretrained Transformer fails the Multiple-Choice American College
of Gastroenterology Self-Assessment Test. Am J Gastroenterol 2023; 118: 2280-2282
MissingFormLabel