Diagnostic accuracy of deep learning vs. human raters for detecting osteoporotic vertebral compression fractures in routine CT scans

By
Evamaria O. Riedel
David Schinz
Matthias Keicher
Sebastian Rühling
Malek El Husseini
Chantal Pellegrini
Thomas Baum
Michael Dieckmeyer
Luca Malagutti
Isabel Seeger
Anna S. Walburga
Benedikt Wiestler
Nico Sollmann
Maximilian T. Löffler
Arthur Wagner
Jan S. Kirschke
February 24, 2026
0 min

European Radiology

Overview

This study compared the diagnostic accuracy of four deep learning models, one commercial DL algorithm, and eight human raters in identifying osteoporotic vertebral compression fractures on routine CT scans. Using a large dataset of 3548 vertebrae from 331 patients, the DL models demonstrated competitive performance relative to human evaluators across multiple fracture severity levels and spinal regions.

Background

Osteoporosis leads to fragile bones and increased risk of vertebral compression fractures, which significantly impact morbidity and mortality. Early and accurate detection of these fractures is critical for timely treatment initiation. While CT imaging is valuable for assessing bone quality, diagnosing mild osteoporotic fractures remains challenging due to overlapping degenerative changes and anatomical variations. Deep learning algorithms have emerged as promising tools to enhance fracture detection by analyzing complex imaging patterns beyond human capability.

Data Highlights

Evaluator Type	Number of Evaluators/Models	Dataset Vertebrae	Fracture Prevalence (%)	CT Acquisition Details
Deep Learning Models	4 in-house + 1 commercial	3548 vertebrae (331 patients)	10.6% any fracture; 9.1% moderate/severe (Genant 2 or 3)	120 kVp, slice thickness 0.9-1.5 mm, bone kernel reconstruction
Human Raters	8 (students, residents, attendings)	Same as above	Same as above	Same as above

Key Findings

Deep learning models were trained on large, heterogeneous CT datasets with multi-scanner environments and diverse acquisition parameters, ensuring robustness.
Evaluation used the independent VerSe 19 & 20 datasets, containing routine clinical CT scans with a broad spectrum of spinal pathologies and anatomical variations.
DL algorithms and human raters were assessed on fracture detection at patient level, single vertebra level, and by spinal region (upper thoracic, lower thoracic, lumbar).
DL models showed comparable or superior accuracy to human evaluators in detecting any fracture (Genant 1–3) and clinically relevant moderate/severe fractures (Genant 2 or 3).
Region-specific analysis accounted for varying fracture prevalence and demonstrated consistent DL performance across spinal regions.
Inclusion of degenerative changes and other osseous alterations in the test set challenged both DL and human raters, highlighting the clinical relevance of the evaluation.

Clinical Implications

Deep learning algorithms can serve as effective adjuncts to human readers in routine CT imaging for osteoporotic vertebral fracture detection, potentially improving diagnostic consistency and early identification. Their ability to analyze subtle imaging features may reduce missed fractures, especially mild ones, facilitating timely intervention. Integration of DL tools into clinical workflows could enhance reporting accuracy and patient management.

Conclusion

The study demonstrates that deep learning models achieve diagnostic performance comparable to experienced human evaluators in identifying osteoporotic vertebral compression fractures on routine CT scans. These findings support the clinical utility of DL algorithms as complementary tools to improve fracture detection and patient outcomes.

References

Kaltenbach et al. 2024 -- Comparative Diagnostic Performance of Deep Learning Algorithms and Human Evaluators in Identifying Osteoporotic Vertebral Compression Fractures on Routine CT Imaging

Diagnostic accuracy of deep learning vs. human raters for detecting osteoporotic vertebral compression fractures in routine CT scans

Comparative Diagnostic Performance of Deep Learning and Humans in Osteoporotic Vertebral Fracture Detection

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Diagnostic accuracy of deep learning vs. human raters for detecting osteoporotic vertebral compression fractures in routine CT scans

Related Content

PEYOLO: a wrist fracture detection network based on multi-level receptive field feature extraction and cross-scale fusion

Dual-energy and perfusion CT for predicting response to chemo-radiotherapy in head and neck cancer: an exploratory study

Incidental Findings Common on Whole-Body Trauma CT