Ultrasonography.  2025 May;44(3):220-231. 10.14366/usg.25012.

Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions

Affiliations
  • 1Department of Radiology and Center for Imaging Sciences, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Korea

Abstract

Purpose
This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.
Methods
This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]–generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.
Results
With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with significant improvements over the basic (7.96%, P=0.009), chain-of-thought (6.47%, P=0.029), and multiagent prompts (5.97%, P=0.043). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. nonrare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).
Conclusion
Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance.

Keyword

Artificial intelligence; Natural language processing; Diagnostic imaging; Medical informatics
Full Text Links
  • USG
Actions
Cited
CITED
export Copy
Close
Share
  • Twitter
  • Facebook
Similar articles
Copyright © 2025 by Korean Association of Medical Journal Editors. All rights reserved.     E-mail: koreamed@kamje.or.kr