RSS-Feed abonnieren

DOI: 10.1055/s-0045-1813040
Human Expertise Outperforms Artificial Intelligence in Medical Education Assessments: MCQ Creation Highlights the Irreplaceable Role of Teachers
Autor*innen
Abstract
Introduction
Multiple-choice questions (MCQs) are vital tools for assessment in education because they allow for the direct measurement of various knowledge, skills, and competencies across a wide range of disciplines. While artificial intelligence (AI) holds promise as a supplementary tool in medical education, particularly for generating large volumes of practice questions, it cannot yet replace the nuanced and expert-driven process of question creation that human educators provide. This study seeks to close the gap, particularly with regard to difficulty index, discrimination index, and distractor efficiency.
Materials and Methods
A total of 50 medical students received a set of fifty randomized, blinded, validated MCQs by human physiology experts. Of these, 25 were made by AI, and the remaining 25 were made by qualified, experienced professors. Using the item response theory (IRT) framework, we calculated key metrics like item reliability, difficulty index, discrimination index, and distractor functionality.
Results
The results demonstrated that the difficulty index of AI-generated MCQs (mean = 0.62, SD = 0.14) was comparable to that of expert-generated questions, with no statistically significant difference observed (p = 0.45). However, significant differences emerged in other key quality metrics. The discrimination index, which reflects a question's ability to distinguish between high- and low-performing students, was notably higher for expert-created MCQs (Mean = 0.48, SD = 0.12) than for those generated by AI (Mean = 0.32, SD = 0.10), indicating a moderate-to-large effect (p = 0.0082, Chi-square = 11.7, df = 3). Similarly, distractor efficiency (DE), which evaluates the effectiveness of incorrect answer options, was significantly greater in expert-authored questions (Mean = 0.24, SD = 7.2) compared to AI-generated items (Mean = 0.4, SD = 8.1), with a moderate effect size (p = 0.0001, Chi-square = 26.2, df = 2). These findings suggest that while AI can replicate human-level difficulty, expert involvement remains crucial for ensuring high-quality discrimination and distractor performance in MCQ design.
Conclusion
The findings suggest that AI holds promise, particularly in generating questions of appropriate difficulty, but human expertise remains essential in crafting high-quality assessments that effectively differentiate between levels of student performance and challenge students' critical thinking. As AI technology continues to evolve, ongoing research and careful implementation will be essential in ensuring that AI contributes positively to medical education.
Publikationsverlauf
Artikel online veröffentlicht:
19. November 2025
© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)
Thieme Medical and Scientific Publishers Pvt. Ltd.
A-12, 2nd Floor, Sector 2, Noida-201301 UP, India
-
References
- 1 Cristianini N. Intelligence reinvented. New Scientist 2016; 232: 37-41
- 2 Gugerty L. Newell and Simon's logic theorist: historical background and impact on cognitive modeling. Proc Hum Factors Ergon Soc Annu Meet 2006; 50: 880-884
- 3 Weizenbaum J. 1966 , 9: 36-45
- 4 Copeland BJ. MYCIN. Encyclopedia Britannica. November 21, 2018. . Accessed October 22, 2025 at: https://www.britannica.com/technology/MYCIN
- 5 Accessed October 22, 2025 at: https://www.geeksforgeeks.org/applications-of-ai/
- 6 Harry A. Role of AI in education. INJURUTY 2023; 2: 260-268
- 7 Crompton H, Bernacki M, Greene JA. Psychological foundations of emerging technologies for teaching and learning in higher education. Curr Opin Psychol 2020; 36: 101-105
- 8 Crompton H, Burke D. Artificial intelligence in higher education: the state of the field. Int J Educ Technol High Educ 2023; 20: 22
- 9 Mir MM, Mir GM, Raina NT. et al. Application of artificial intelligence in medical education: current scenario and future perspectives. J Adv Med Educ Prof 2023; 11 (03) 133-140
- 10 Sawand FA, Chandio BA, Bilal M, Rasheed MR, Raza MA, Ahmad N. Quality assessment in higher education. Intern Let Soc Hum Sci 2023; 50: 162-171
- 11 https://www.watermarkinsights.com/resources/blog/importance-of-assessment-in-higher-education
- 12 Nair GG, Feroze M. Effectiveness of multiple-choice questions (MCQS) discussion as a learning enhancer in conventional lecture class of undergraduate medical students. Medical Journal of Dr. D.Y. Patil Vidyapeeth 2023; 16 (Suppl. 02) S183-S188
- 13 Roediger HL, Karpicke JD. Test-enhanced learning: taking memory tests improves long-term retention. Psychol Sci 2006; 17 (03) 249-255
- 14 Nicol D. E-assessment by design: using multiple-choice tests to good effect. J Furth High Educ 2007; 31: 53-64
- 15 Xhaferi B, Xhaferi G. Enhancing learning through reflection - a case study of SEEU. SEEU Review 2016; 12: 53-68
- 16 Al-Rukban MO. Guidelines for the construction of multiple choice questions tests. J Family Community Med 2006; 13 (03) 125-133
- 17 Maheen F, Asif M, Ahmad H. et al. Automatic computer science domain multiple-choice questions generation based on informative sentences. PeerJ Comput Sci 2022; 8: e1010
- 18 Kıyak YS, Coşkun Ö, Budakoğlu Iİ, Uluoğlu C. ChatGPT for generating multiple-choice questions: evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam. Eur J Clin Pharmacol 2024; 80 (05) 729-735
- 19 Ngo A, Gupta S, Perrine O, Reddy R, Ershadi S, Remick D. ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. Acad Pathol 2023; 11 (01) 100099
- 20 Haladyna TM, Downing SM, Rodriguez MC. A review of multiple-choice item-writing guidelines for classroom assessment. Appl Meas Educ 2002; 15: 309-333
- 21 Dion V, St-Onge C, Bartman I, Touchie C, Pugh D. Written-based progress testing: a scoping review. Acad Med 2022; 97 (05) 747-757
- 22 Iñarrairaegui M, Fernández-Ros N, Lucena F. et al. Evaluation of the quality of multiple-choice questions according to the students' academic level. BMC Med Educ 2022; 22 (01) 779-2022
- 23 Vegada B, Shukla A, Khilnani A, Charan J, Desai C. Comparison between three option, four option and five option multiple choice question tests for quality parameters: a randomized study. Indian J Pharmacol 2016; 48 (05) 571-575
- 24 Gierl MJ, Bulut O, Guo Q, Zhang X. Developing, analyzing, and using distractors for multiple-choice tests in education: a comprehensive review. Rev Educ Res 2017; 87: 1082-1116
- 25 Kumar D, Jaipurkar R, Shekhar A, Sikri G, Srinivas V. Item analysis of multiple choice questions: a quality assurance test for an assessment tool. Med J Armed Forces India 2021; 77 (Suppl. 01) S85-S89
- 26 Epstein RM. Assessment in medical education. N Engl J Med 2007; 356 (04) 387-396
- 27 Kung TH, Cheatham M, Medenilla A. et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023; 2 (02) e0000198
