Subscribe to RSS
DOI: 10.1055/a-2818-1611
Evaluation of ChatGPT and Gemini in Answering Patient Questions after Gynecologic Surgery
Authors
Funding Information No grants or industry funding was used in preparation of this manuscript.
Abstract
Objective
This study aimed to explore the performance of ChatGPT version 4.0 (GPT-4) and Gemini Advanced (Gemini) large language models (LLMs) in addressing common patient questions after gynecology surgery with regards to accuracy, relevance, helpfulness to the average patient, and readability.
Methods
In this cross-sectional study, the two LLMs were prompted to generate answers to postoperative patient questions after gynecologic surgery. Postoperative patient questions were developed to simulate common patient questions after gynecologic surgery, based on expert opinion and compiled from anonymous posters on Reddit (r/endometriosis). Questions were focused on six topics: endometriosis, vaginal bleeding, bowel/bladder function, incision care, resumption of activities, and sexual function. Questions were asked in a systematic three-step submission process with the memory reset after each query. Responses were then blinded and independently assessed for accuracy and relevance on a 5-Point Likert scale by four board-certified gynecologic surgeons with fellowship training in gynecologic surgery. Readability of the answers was calculated with the Flesch Kincaid grade level calculator. Responses were also evaluated by three clinic nurses.
Results
A total of 41 questions were posed to GPT-4 and Gemini three times. The responses were independently evaluated by four surgeons and three nurses leading to a total of 1,968 evaluations for accuracy, relevance, helpfulness to the average patient, and readability. Surgeons and nurses graded Gemini responses as more accurate (4.23 vs. 4.03, p = 0.015) and helpful (4.37 vs. 4.21, p = 0.025) than GPT-4 responses. Responses from both models were similarly found to be relevant or very relevant (4.45 vs. 4.36, p = 0.2). Most responses by GPT-4 (85%) and Gemini (87%) were consistent across all questions. The average reading level for GPT-4 and Gemini responses were 11th and 10th grade, above the recommended 6th grade reading level for patient information.
Conclusion
GPT-4 and Gemini provided overall accurate, relevant, and helpful responses to common postoperative patient questions for gynecologic surgery. Gemini outperformed GPT-4 in both accuracy and helpfulness and had objectively more readable responses.
Keywords
large language models - patient questions - postoperative instructions - patient–provider communicationDeclaration of GenAI Use
No generative artificial intelligence (AI) or manuscript preparation assistance was used in the writing of this work.
Protection of Human and Animal Subjects
This study was reviewed by the Institutional Review Board of the Northwestern University Feinberg School of Medicine and determined to be exempt. Human and/or animal subjects were not included in this study.
Note
No additional persons contributed to the work reported in the manuscript outside of the manuscript authors.
This study was presented as a Rapid Oral Poster at the Society for Gynecology Surgeons Annual Meeting 2025, March 30–April 2, 2025, in Palm Springs, California, United States.
Publication History
Received: 04 August 2025
Accepted after revision: 19 February 2026
Accepted Manuscript online:
24 February 2026
Article published online:
30 March 2026
© 2026. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Tai-Seale M, Dillon EC, Yang Y. et al. Physicians' well-being linked to in-basket messages generated by algorithms in electronic health records. Health Aff (Millwood) 2019; 38 (07) 1073-1078
- 2 Ayers JW, Poliak A, Dredze M. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 2023; 183 (06) 589-596
- 3 Sluder CE, Liu YF, Meyer TA, Rizk HG, Lambert PR, McRackan TR. Quality improvement in otologic surgery postoperative instructions. Otol Neurotol 2021; 42 (08) 1165-1171
- 4 Verdi EB, Akbilgic O. Comparing the performances of a 54-year-old computer-based consultation to ChatGPT-4o. Appl Clin Inform 2025; 16 (05) 1627-1636
- 5 Katz U, Cohen E, Shachar E. et al. GPT versus resident physicians—a benchmark based on official board scores. NEJM AI 2024; 1 (05) AIdbp2300192
- 6 Kaur A, Budko A, Liu K, Eaton E, Steitz BD, Johnson KB. Automating responses to patient portal messages using generative AI. Appl Clin Inform 2025; 16 (03) 718-731
- 7 Grünebaum A, Chervenak J, Pollet SL, Katz A, Chervenak FA. The exciting potential for ChatGPT in obstetrics and gynecology. Am J Obstet Gynecol 2023; 228 (06) 696-705
- 8 Morales-Ramirez P, Mishek H, Dasgupta A. The genie is out of the bottle: what ChatGPT can and cannot do for medical professionals. Obstet Gynecol 2024; 143 (01) e1-e6
- 9 Hanna JJ, Wakene AD, Lehmann CU, Medford RJ. Assessing racial and ethnic bias in text generation for healthcare-related tasks by ChatGPT(1). medRxiv 2023; (e-pub head of print).
- 10 Gordon EB, Towbin AJ, Wingrove P. et al. Enhancing patient communication with Chat-GPT in radiology: evaluating the efficacy and readability of answers to common imaging-related questions. J Am Coll Radiol 2024; 21 (02) 353-359
- 11 Tepe M, Emekli E. Assessing the responses of large language models (ChatGPT-4, Gemini, and Microsoft Copilot) to frequently asked questions in breast imaging: a study on readability and accuracy. Cureus 2024; 16 (05) e59960
- 12 Dhar S, Kothari D, Vasquez M. et al. The utility and accuracy of ChatGPT in providing post-operative instructions following tonsillectomy: a pilot study. Int J Pediatr Otorhinolaryngol 2024; 179: 111901
- 13 Mudrik A, Tsur A, Nadkarni GN. et al. Leveraging large language models in gynecologic oncology: a systematic review of current applications and challenges. medRxiv (e-pub ahead of print).
- 14 Meyer R, Hamilton KM, Truong MD. et al. ChatGPT compared with Google Search and healthcare institution as sources of postoperative patient instructions after gynecological surgery. BJOG 2024; 131 (08) 1154-1156
- 15 Samaan JS, Yeo YH, Rajeev N. et al. Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery. Obes Surg 2023; 33 (06) 1790-1796
- 16 Zhou J, Li H, Chen S, Chen Z, Han Z, Gao X. Large language models in biomedicine and healthcare. Artif Intell 2025; 1 (01) 44
- 17 Belmonte B, Todd L, Pan E, Southworth E, Gorman A, Florian-Rodriguez M. Closing the gap: using chatbots to improve the readability of patient education materials in urogynecology. Obstet Gynecol 2025; 145 (5S): 20S-21S
