Abstract
Objectives The main objective of this study is to evaluate the ability of the Large Language
Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United
States Medical Licensing Examination (USMLE) board-style medical ethics questions
compared to medical knowledge-based questions. This study has the additional objectives
of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability
of responses given by each version.
Methods Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group
of 27 medical ethics questions and a second group of 27 medical knowledge questions
matched on question difficulty for medical students. We ran 30 trials asking these
questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability
regression model evaluated accuracy and a Shannon entropy calculation evaluated response
variation.
Results Both versions of ChatGPT demonstrated worse performance on medical ethics questions
compared to medical knowledge questions. GPT-4 performed 18% points (p < 0.05) worse on medical ethics questions compared to medical knowledge questions
and GPT-3.5 performed 7% points (p = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points (p < 0.001) on medical ethics and 33% points (p < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy
for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than
GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response.
Conclusion Both versions of ChatGPT performed more poorly on medical ethics questions compared
to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall
accuracy and exhibited a significantly lower response variability in answer choices.
This underscores the need for ongoing assessment of ChatGPT versions for medical education.
Keywords
ChatGPT - large language model - artificial intelligence - medical education - USMLE
- ethics