Pediatr Emerg Med J.  2025 Apr;12(2):62-72. 10.22470/pemj.2024.01074.

Accuracy, appropriateness, and readability of ChatGPT-4 and ChatGPT-3.5 in answering pediatric emergency medicine post-discharge questions

Affiliations
  • 1Department of Diagnostic Medicine, Dell Medical School, The University of Texas at Austin, Austin, TX, USA
  • 2Undergraduate Program, The University of Texas at Austin, Austin, TX, USA
  • 3Department of Pediatric Emergency Medicine, Dell Medical School, The University of Texas at Austin, Austin, TX, USA

Abstract

Purpose
Large language models (LLMs) like ChatGPT (OpenAI) are increasingly used in healthcare, raising questions about their accuracy and reliability for medical information. This study compared 2 versions of ChatGPT in answering post-discharge follow-up questions in the area of pediatric emergency medicine (PEM).
Methods
Twenty-three common post-discharge questions were posed to ChatGPT-4 and -3.5, with responses generated before and after a simplification request. Two blinded PEM physicians evaluated appropriateness and accuracy as the primary endpoint. Secondary endpoints included word count and readability. Six established reading scales were averaged, including the Automated Readability Index, Gunning Fog Index, Flesch-Kincaid Grade Level, Coleman-Liau Index, Simple Measure of Gobbledygook Grade Level, and Flesch Reading Ease. T-tests and Cohen’s kappa were used to determine differences and inter-rater agreement, respectively.
Results
The physician evaluations showed high appropriateness for both defaults (ChatGPT-4, 91.3%-100% vs. ChatGPT-3.5, 91.3%) and simplified responses (both 87.0%-91.3%). The accuracy was also high for default (87.0%-95.7% vs. 87.0%-91.3%) and simplified responses (both 82.6%-91.3%). The inter-rater agreement was fair overall (κ = 0.37; P < 0.001). For default responses, ChatGPT-4 produced longer outputs than ChatGPT-3.5 (233.0 ± 97.1 vs. 199.6 ± 94.7 words; P = 0.043), with a similar readability (13.3 ± 1.9 vs. 13.5 ± 1.8; P = 0.404). After simplification, both LLMs improved word count and readability (P < 0.001), with ChatGPT-4 achieving a readability suitable for the eighth grade students in the United States (7.7 ± 1.3 vs. 8.2 ± 1.5; P = 0.027).
Conclusion
The responses of ChatGPT-4 and -3.5 to post-discharge questions were deemed appropriate and accurate by the PEM physicians. While ChatGPT-4 showed an edge in simplifying language, neither LLM consistently met the recommended reading level of sixth grade students. These findings suggest a potential for LLMs to communicate with guardians.

Keyword

Artificial Intelligence; Patient Discharge; Patient Education as Topic; Pediatric Emergency Medicine; Language

Reference

References

1. Cocco AM, Zordan R, Taylor DM, Weiland TJ, Dilley SJ, Kant J, et al. Dr Google in the ED: searching for online health information by adult emergency department patients. Med J Aust. 2018; 209:342–7.
Article
2. Van Riel N, Auwerx K, Debbaut P, Van Hees S, Schoenmakers B. The effect of Dr Google on doctor-patient encounters in primary care: a quantitative, observational, cross-sectional study. BJGP Open. 2017; 1:bjgpopen17X100833.
Article
3. Man A, van Ballegooie C. Assessment of the readability of web-based patient education material from major Canadian pediatric associations: cross-sectional study. JMIR Pediatr Parent. 2022; 5:e31820.
Article
4. National Institutes of Health (NIH). Clear & simple [Internet]. NIH; 2015 [cited 2024 May 3]. Available from: https://www.nih.gov/institutes-nih/nih-office-director/office-communications-public-liaison/clear-communication/clear-simple.
5. Morrison AK, Schapira MM, Gorelick MH, Hoffmann RG, Brousseau DC. Low caregiver health literacy is associated with higher pediatric emergency department use and nonurgent visits. Acad Pediatr. 2014; 14:309–14.
Article
6. Rak EC, Hooper SR, Belsante MJ, Burnett O, Layton B, Tauer D, et al. Caregiver word reading literacy and health outcomes among children treated in a pediatric nephrology practice. Clin Kidney J. 2016; 9:510–5.
Article
7. Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI responds to common lung cancer questions: ChatGPT vs Google Bard. Radiology. 2023; 307:e230922.
8. Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA. 2023; 329:842–4.
Article
9. Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023; 183:589–96.
Article
10. Sudharshan R, Shen A, Gupta S, Zhang-Nunes S. Assessing the utility of ChatGPT in simplifying text complexity of patient educational materials. Cureus. 2024; 16:e55304.
Article
11. Korngiebel DM, Mooney SD. Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3 (GPT-3) in healthcare delivery. Npj Digit Med. 2021; 4:93.
Article
Full Text Links
  • PEMJ
Actions
Cited
CITED
export Copy
Close
Share
  • Twitter
  • Facebook
Similar articles
Copyright © 2025 by Korean Association of Medical Journal Editors. All rights reserved.     E-mail: koreamed@kamje.or.kr