Appl Clin Inform 2026; 17(02): 194-203
DOI: 10.1055/a-2835-4634
Research Article

Comparing Large Language Models' Performances on Otolaryngology Knowledge Assessment Questions

Authors

  • Ryan Cook

    1   Albert Einstein College of Medicine, Bronx, New York, United States
  • Abner Kahan

    1   Albert Einstein College of Medicine, Bronx, New York, United States
  • Thomas Scharfenberger

    1   Albert Einstein College of Medicine, Bronx, New York, United States
  • Jason Tasoulas

    2   Department of Otolaryngology-Head and Neck Surgery, Thomas Jefferson University, Philadelphia, Pennsylvania, United States
  • Noah Hawks-Ladds

    1   Albert Einstein College of Medicine, Bronx, New York, United States
  • Robert Chouake

    3   Department of Otolaryngology-Head and Neck Surgery, Montefiore Einstein, Bronx, New York, United States
  • Sunit P. Jariwala

    4   Department of Medicine, Montefiore Einstein, Bronx, New York, United States
  • Shitij Arora

    4   Department of Medicine, Montefiore Einstein, Bronx, New York, United States

Abstract

Objectives

This study evaluates the performance of multiple large language models (LLMs) on specialized otolaryngology knowledge, comparing OpenAI's GPT-4 Turbo with 10 commercially available models to assess their potential utility in otolaryngology medical education.

Methods

A total of 1,075 questions from OTO QUEST, the official self-assessment resource of the American Academy of Otolaryngology–Head and Neck Surgery, were administered to GPT-4 Turbo using a zero-shot approach. Accuracy was analyzed using logistic regression, adjusting for question difficulty, year, and subspecialty. Performance was subsequently compared with 10 other commercial models (including Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o) using the 1,075 question dataset, evaluated with Cochran's Q test (p < 0.001) and McNemar's pairwise comparison.

Results

GPT-4 Turbo achieved an overall accuracy of 72.09% (95% confidence interval [CI]: 69.3–74.7%) across 1,075 questions. It performed best in Practice Management questions (odds ratio [OR] = 3.93, 95% CI: 1.12–13.73, p = 0.032) and declined in accuracy when faced with questions of moderate and hard difficulty (OR = 0.21, 95% CI: 0.16–0.29, p < 0.001 and OR = 0.04, 95% CI: 0.01–0.10, p < 0.001, respectively). In comparative analysis, Grok-3 ranked highest with 76.3% accuracy (95% CI: 73.6–78.7%), followed by Claude-3.5-sonnet (73.0%, 95% CI: 70.3–75.6%) and GPT-4o (69.9%, 95% CI: 67.1–72.5%), with GPT-4 Turbo application programming interface ranking fourth.

Conclusion

This comprehensive model comparison reveals that while major commercial LLMs show promising capabilities in specialized medical knowledge assessment, they demonstrate an apparent accuracy plateau around 73 to 76%. These findings suggest current general-purpose LLMs may require specialized training approaches to advance beyond this performance threshold in medical domains.

Protection of Human and Animal Subjects

This study was performed in compliance with the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research Involving Human Subjects and was reviewed by the Albert Einstein College of Medicine Institutional Review Board.




Publication History

Received: 01 July 2025

Accepted after revision: 16 March 2026

Article published online:
06 April 2026

© 2026. Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany