Obstet Gynecol Sci.  2024 Nov;67(6):550-556. 10.5468/ogs.24211.

Efficacy of large language models and their potential in Obstetrics and Gynecology education

Affiliations
  • 1Department of Obstetrics and Gynecology, Yongin Severance Hospital, Yonsei University College of Medicine, Yongin, Korea
  • 2Department of Obstetrics and Gynecology, Institute of Women’s Medical Life Science, Severance Hospital, Yonsei University College of Medicine, Seoul, Korea
  • 3Department of Obstetrics and Gynecology, Gangnam Severance Hospital, Yonsei University College of Medicine, Seoul, Korea

Abstract


Objective
The performance of large language models (LLMs) and their potential utility in obstetric and gynecological education are topics of ongoing debate. This study aimed to contribute to this discussion by examining the recent advancements in LLM technology and their transformative potential in artificial intelligence.
Methods
This study assessed the performance of generative pre-trained transformer (GPT)-3.5 and -4 in understanding clinical information, as well as its potential implications for obstetric and gynecological education. Obstetrics and gynecology residents at three hospitals underwent an annual promotional examination, from which 116 of the 170 questions over 4 years (2020-2023) were analyzed, excluding 54 questions with images. The scores achieved by GPT-3.5, -4, and the 100 residents were compared.
Results
The average scores across all 4 years for GPT-3.5 and -4 were 38.79 (standard deviation [SD], 5.65) and 79.31 (SD, 3.67), respectively. For groups first-year resident, second-year resident, and third-year resident, the cumulative annual average scores were 79.12 (SD, 9.00), 80.95 (SD, 5.86), and 83.60 (SD, 6.82), respectively. No statistically significant differences were observed between the scores of GPT-4.0 and those of the residents. When analyzing questions specific to obstetrics, the average scores for GPT-3.5 and -4.0 were 33.44 (SD, 10.18) and 90.22 (SD, 7.68), respectively.
Conclusion
GPT-4 demonstrated exceptional performance in obstetrics, different types of data interpretation, and problem solving, showcasing the potential utility of LLMs in these areas. However, acknowledging the constraints of LLMs is crucial and their utilization should augment human expertise and discernment.

Keyword

Artificial intelligence; Obstetrics; Gynecology; Medical education

Figure

  • Fig. 1 Dataset preparation for model evaluation. GPT, generative pre-trained transformer.

  • Fig. 2 Comparison of the performance of GPT-3.5, −4, and obstetrics and gynecology residents. GPT, generative pre-trained transformer; R, resident.

  • Fig. 3 Comparison of the performance of GPT-4 with overall accuracies according to its subspecialties. GPT, generative pre-trained transformer; SD, standard deviation.


Reference

References

1. Stokel-Walker C, Van Noorden R. What ChatGPT and generative AI mean for science. Nature. 2023; 614:214–6.
2. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023; 2:e0000198.
3. Mbakwe AB, Lourentzou I, Celi LA, Mechanic OJ, Dagan A. ChatGPT passing USMLE shines a spotlight on the flaws of medical education. PLOS Digit Health. 2023; 2:e0000205.
4. Oh N, Choi GS, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res. 2023; 104:269–73.
5. Yaneva V, Baldwin P, Jurich DP, Swygert K, Clauser BE. Examining ChatGPT performance on USMLE sample items and implications for assessment. Acad Med. 2024; 99:192–7.
6. Watari T, Takagi S, Sakaguchi K, Nishizaki Y, Shimizu T, Yamamoto Y, et al. Performance comparison of Chat-GPT-4 and Japanese medical residents in the general medicine in-training examination: comparison study. JMIR Med Educ. 2023; 9:e52202.
7. Ali R, Tang OY, Connolly ID, Zadnik Sullivan PL, Shin JH, Fridley JS, et al. Performance of ChatGPT and GPT-4 on neurosurgery written board examinations. Neurosurgery. 2023; 93:1353–65.
8. Chen KT, Baecher-Lind L, Morosky CM, Bhargava R, Fleming A, Royce CS, et al. Current practices and perspectives on clerkship grading in obstetrics and gynecology. Am J Obstet Gynecol. 2024; 230:97e1–6.
9. Wartman SA, Combs CD. Medical education must move from the information age to the age of artificial intelligence. Acad Med. 2018; 93:1107–9.
10. Ahn KH, Lee KS. Artificial intelligence in obstetrics. Obstet Gynecol Sci. 2022; 65:113–24.
11. Ong H, Ong J, Cheng R, Wang C, Lin M, Ong D. GPT technology to help address longstanding barriers to care in free medical clinics. Ann Biomed Eng. 2023; 51:1906–9.
12. Bhattarai K, Oh IY, Sierra JM, Tang J, Payne PRO, Abrams ZB, et al. Leveraging GPT-4 for identifying cancer phenotypes in electronic health records: a performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5 and spaCy’s rule-based & machine learning-based methods. JAMIA Open. 2024; 7:ooae060.
13. Phung A, Daniels G, Curran M, Robinson S, Maiz A, Jaqua B. Multispecialty trainee perspective: the journey toward competency-based graduate medical education in the United States. J Grad Med Educ. 2023; 15:617–22.
14. Kapadia MR, Kieran K. Being affable, available, and able is not enough: prioritizing surgeon-patient communication. JAMA Surg. 2020; 155:277–8.
15. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023; 9:e48291.
16. Jamal A, Solaiman M, Alhasan K, Temsah MH, Sayed G. Integrating ChatGPT in medical education: adapting curricula to cultivate competent physicians for the AI Era. Cureus. 2023; 15:e43036.
17. Han ER, Yeo S, Kim MJ, Lee YH, Park KH, Roh H. Medical education trends for future physicians in the era of advanced technology and artificial intelligence: an integrative review. BMC Med Educ. 2019; 19:460.
18. Sharma A, Kumar R, Vinjamuri S. Artificial intelligence chatbots: addressing the stochastic parrots in medical science. Nucl Med Commun. 2023; 44:831–3.
19. Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022; 23:bbac409.
20. Zagirova D, Pushkov S, Leung GHD, Liu BHM, Urban A, Sidorenko D, et al. Biomedical generative pre-trained based transformer language model for age-related disease target discovery. Aging (Albany NY). 2023; 15:9293–309.
Full Text Links
  • OGS
Actions
Cited
CITED
export Copy
Close
Share
  • Twitter
  • Facebook
Similar articles
Copyright © 2025 by Korean Association of Medical Journal Editors. All rights reserved.     E-mail: koreamed@kamje.or.kr