Korean J Radiol.  2023 Feb;24(2):155-165. 10.3348/kjr.2022.0548.

Effects of Expert-Determined Reference Standards in Evaluating the Diagnostic Performance of a Deep Learning Model: A Malignant Lung Nodule Detection Task on Chest Radiographs

Affiliations
  • 1Institute of Medical and Biological Engineering, Medical Research Center, Seoul National University, Seoul, Korea
  • 2Mathematical Institute, University of Oxford, United Kingdom
  • 3Department of Radiology, Seoul National University Hospital, Seoul, Korea
  • 4Department of Radiology, Seoul National University College of Medicine, Seoul, Korea
  • 5Institute of Radiation Medicine, Medical Research Center, Seoul National University, Seoul, Korea

Abstract


Objective
Little is known about the effects of using different expert-determined reference standards when evaluating the performance of deep learning-based automatic detection (DLAD) models and their added value to radiologists. We assessed the concordance of expert-determined standards with a clinical gold standard (herein, pathological confirmation) and the effects of different expert-determined reference standards on the estimates of radiologists’ diagnostic performance to detect malignant pulmonary nodules on chest radiographs with and without the assistance of a DLAD model.
Materials and Methods
This study included chest radiographs from 50 patients with pathologically proven lung cancer and 50 controls. Five expert-determined standards were constructed using the interpretations of 10 experts: individual judgment by the most experienced expert, majority vote, consensus judgments of two and three experts, and a latent class analysis (LCA) model. In separate reader tests, additional 10 radiologists independently interpreted the radiographs and then assisted with the DLAD model. Their diagnostic performance was estimated using the clinical gold standard and various expertdetermined standards as the reference standard, and the results were compared using the t test with Bonferroni correction.
Results
The LCA model (sensitivity, 72.6%; specificity, 100%) was most similar to the clinical gold standard. When expertdetermined standards were used, the sensitivities of radiologists and DLAD model alone were overestimated, and their specificities were underestimated (all p-values < 0.05). DLAD assistance diminished the overestimation of sensitivity but exaggerated the underestimation of specificity (all p-values < 0.001). The DLAD model improved sensitivity and specificity to a greater extent when using the clinical gold standard than when using the expert-determined standards (all p-values < 0.001), except for sensitivity with the LCA model (p = 0.094).
Conclusion
The LCA model was most similar to the clinical gold standard for malignant pulmonary nodule detection on chest radiographs. Expert-determined standards caused bias in measuring the diagnostic performance of the artificial intelligence model.

Keyword

Deep-learning; Reference standard; Expert-determined standard; Decision-support tool; Chest radiographs
Full Text Links
  • KJR
Actions
Cited
CITED
export Copy
Close
Share
  • Twitter
  • Facebook
Similar articles
Copyright © 2024 by Korean Association of Medical Journal Editors. All rights reserved.     E-mail: koreamed@kamje.or.kr