Subscribe to RSS
DOI: 10.1055/a-2835-4634
Comparing Large Language Models' Performances on Otolaryngology Knowledge Assessment Questions
Authors
Abstract
Objectives
This study evaluates the performance of multiple large language models (LLMs) on specialized otolaryngology knowledge, comparing OpenAI's GPT-4 Turbo with 10 commercially available models to assess their potential utility in otolaryngology medical education.
Methods
A total of 1,075 questions from OTO QUEST, the official self-assessment resource of the American Academy of Otolaryngology–Head and Neck Surgery, were administered to GPT-4 Turbo using a zero-shot approach. Accuracy was analyzed using logistic regression, adjusting for question difficulty, year, and subspecialty. Performance was subsequently compared with 10 other commercial models (including Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o) using the 1,075 question dataset, evaluated with Cochran's Q test (p < 0.001) and McNemar's pairwise comparison.
Results
GPT-4 Turbo achieved an overall accuracy of 72.09% (95% confidence interval [CI]: 69.3–74.7%) across 1,075 questions. It performed best in Practice Management questions (odds ratio [OR] = 3.93, 95% CI: 1.12–13.73, p = 0.032) and declined in accuracy when faced with questions of moderate and hard difficulty (OR = 0.21, 95% CI: 0.16–0.29, p < 0.001 and OR = 0.04, 95% CI: 0.01–0.10, p < 0.001, respectively). In comparative analysis, Grok-3 ranked highest with 76.3% accuracy (95% CI: 73.6–78.7%), followed by Claude-3.5-sonnet (73.0%, 95% CI: 70.3–75.6%) and GPT-4o (69.9%, 95% CI: 67.1–72.5%), with GPT-4 Turbo application programming interface ranking fourth.
Conclusion
This comprehensive model comparison reveals that while major commercial LLMs show promising capabilities in specialized medical knowledge assessment, they demonstrate an apparent accuracy plateau around 73 to 76%. These findings suggest current general-purpose LLMs may require specialized training approaches to advance beyond this performance threshold in medical domains.
Keywords
artificial intelligence - large language models - medical education - otolaryngology training - benchmark comparisonProtection of Human and Animal Subjects
This study was performed in compliance with the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research Involving Human Subjects and was reviewed by the Albert Einstein College of Medicine Institutional Review Board.
Publication History
Received: 01 July 2025
Accepted after revision: 16 March 2026
Article published online:
06 April 2026
© 2026. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Wang F, Casalino LP, Khullar D. Deep learning in medicine-promise, progress, and challenges. JAMA Intern Med 2019; 179 (03) 293-294
- 2 Shen D, Wu G, Suk H-I. Deep learning in medical image analysis. Annu Rev Biomed Eng 2017; 19 (01) 221-248
- 3 Esteva A, Chou K, Yeung S. et al. Deep learning-enabled medical computer vision. NPJ Digit Med 2021; 4 (01) 5
- 4 Mo Y, Qin H, Dong Y, Zhu Z, Li Z. Large language model (LLM) AI text generation detection based on transformer deep learning algorithm. arXiv preprint arXiv:240506652. 2024
- 5 Singh SK, Kumar S, Mehra PS. Chat GPT & Google Bard AI: a review. IEEE; 2023: 1-6
- 6 Kung TH, Cheatham M, Medenilla A. et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023; 2 (02) e0000198
- 7 Revercomb L, Patel AM, Fu D, Filimonov A. Performance of novel GPT-4 in otolaryngology knowledge assessment. Indian J Otolaryngol Head Neck Surg 2024; 76 (06) 6112-6114
- 8 Raine BE, Kozlowski KA, Fowler CC, Frey JD. Performance of ChatGPT on the plastic surgery in-training examination. Eplasty 2024; 24: e68
- 9 Miao J, Thongprayoon C, Valencia OAG. et al. Performance of ChatGPT on nephrology test questions. Clin J Am Soc Nephrol 2023; 10: 2215
- 10 Min S, Michael J, Hajishirzi H, Zettlemoyer L. AmbigQA: answering ambiguous open-domain questions. arXiv preprint arXiv:200410645; 2020
- 11 Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Cold Spring Harbor Laboratory; 2023
- 12 Merlino DJ, Brufau SR, Saieed G. et al. Comparative assessment of otolaryngology knowledge among large language models. Laryngoscope 2025; 135 (02) 629-634
- 13 Danehy T, Hecht J, Kentis S, Schechter CB, Jariwala SP. ChatGPT performs worse on USMLE-style ethics questions compared to medical knowledge questions. Appl Clin Inform 2024; 15 (05) 1049-1055
- 14 Hager P, Jungmann F, Holland R. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med 2024; 30 (09) 2613-2622
- 15 Hadi MU, Qureshi R, Shah A. et al. Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Preprints. 2023 ;1: 1-26
- 16 Brown KE, Yan C, Li Z. et al. Large language models are less effective at clinical prediction tasks than locally trained machine learning models. J Am Med Inform Assoc 2025; 32 (05) 811-822
- 17 Kim J, Podlasek A, Shidara K, Liu F, Alaa A, Bernardo D. Limitations of large language models in clinical problem-solving arising from inflexible reasoning. arXiv preprint arXiv:250204381. 2025
- 18 Hou Z, Liu H, Bian J, He X, Zhuang Y. Enhancing medical coding efficiency through domain-specific fine-tuned large language models. Npj Health Syst 2025; 2 (01) 14
- 19 Zhang D, Xue X, Gao P. et al. A survey of datasets in medicine for large language models. Intell Robotics 2024; 4 (04) 457-478
- 20 Chen Q, Hu Y, Peng X. et al. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nat Commun 2025; 16 (01) 3280
- 21 Bubeck S, Chadrasekaran V, Eldan R. et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv; 2023
- 22 Fiok K, Karwowski W, Gutierrez E, Reza-Davahli M. Comparing the quality and speed of sentence classification with modern language models. Appl Sci (Basel) 2020; 10 (10) 3386
- 23 OTO QUEST Knowledge Assessment Tool. February 2, 2026. https://www.entnet.org/education/oto-quest/
- 24 Murthy AB, Palaniappan V, Radhakrishnan S, Rajaa S, Karthikeyan K. A comparative analysis of the performance of large language models and human respondents in dermatology. Indian Dermatol Online J 2025; 16 (02) 241-247
- 25 Jeong M, Sohn J, Sung M, Kang J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics 2024; 40 (Suppl. 01) i119-i129
- 26 Long C, Subburam D, Lowe K. et al. ChatENT: augmented large language model for expert knowledge retrieval in otolaryngology–head and neck surgery. Otolaryngol Head Neck Surg 2024; 171 (04) 1042-1051
- 27 Clusmann J, Kolbinger FR, Muti HS. et al. The future landscape of large language models in medicine. Commun Med (Lond) 2023; 3 (01) 141
- 28 Maity S, Saikia MJ. Large language models in healthcare and medical applications: a review. Bioengineering (Basel) 2025; 12 (06) 631
- 29 Mugaanyi J, Cai L, Cheng S, Lu C, Huang J. Evaluation of large language model performance and reliability for citations and references in scholarly writing: cross-disciplinary study. J Med Internet Res 2024; 26: e52935
- 30 Asgari E, Montaña-Brown N, Dubois M. et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. NPJ Digit Med 2025; 8 (01) 274
- 31 Shool S, Adimi S, Saboori Amleshi R, Bitaraf E, Golpira R, Tara M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med Inform Decis Mak 2025; 25 (01) 117
- 32 Anisuzzaman DM, Malins JG, Friedman PA, Attia ZI. Fine-tuning large language models for specialized use cases. Mayo Clin Proc Digit Health 2024; 3 (01) 100184
- 33 Yang W, Some L, Bain M, Kang B. A comprehensive survey on integrating large language models with knowledge-based methods. arXiv preprint arXiv:250113947. 2025
- 34 Shi H, Xu Z, Wang H. et al. Continual learning of large language models: a comprehensive survey. arXiv preprint arXiv:240416789. 2024
- 35 Armoundas AA, Loscalzo J. Patient agency and large language models in worldwide encoding of equity. NPJ Digit Med 2025; 8 (01) 258
