Impact of test set composition on AI performance in pediatric wrist fracture detection in X-rays

By
Tristan Till
Mario Scherkl
Nikolaus Stranger
Georg Singer
Saskia Hankel
Christina Flucher
Franko Hržić
Ivan Štajduhar
Sebastian Tschauner
May 16, 2025
0 min

European Radiology

Overview

This study demonstrates that the composition of test sets, particularly the balance of case difficulty, significantly influences the performance metrics of AI models detecting pediatric wrist fractures on X-rays. Two test sets—one balanced for difficulty and fracture presence, and one randomly sampled—were used to evaluate deep-learning models, revealing notable differences in accuracy and detection capabilities.

Background

Artificial intelligence applications in musculoskeletal imaging, such as fracture detection in pediatric wrist radiographs, rely heavily on the quality and characteristics of test datasets to validate model performance. Ideally, test sets should be independent and representative of clinical variability to ensure generalizability. Previous studies often used randomly selected or consecutive case series without accounting for case difficulty, which may inflate performance metrics by including easily recognizable fractures. This study investigates how sampling strategies based on case complexity affect AI model accuracy.

Data Highlights

Test Set	Total Images	Fracture Cases (%)	Difficult Cases (%)
Balanced	4588	50%	25%
Random	4588	Similar to full dataset	6%
Overlap	1022	Not specified	Not specified

Key Findings

The balanced test set was constructed to include 50% fracture cases with half of those being difficult, while the random set reflected the natural distribution with only 6% difficult cases.
Patient demographics such as gender, laterality, and initial examination rates were comparable between the balanced and random test sets.
Two standard computer vision algorithms (EfficientNet variants B0 to B7) were trained identically and tested on both sets to evaluate performance differences.
Performance metrics varied significantly depending on the test set composition, with models showing reduced accuracy on the balanced set containing more difficult cases.
The study highlights that random sampling may overestimate AI performance by underrepresenting challenging fracture cases.

Clinical Implications

Clinicians and developers should recognize that AI model performance metrics are highly dependent on the test set characteristics, especially case difficulty distribution. For meaningful evaluation and regulatory approval, test sets should be carefully curated to include a representative spectrum of fracture complexities. This approach ensures AI tools are robust and clinically useful in detecting subtle or difficult fractures, not just obvious cases.

Conclusion

The study confirms that test set sampling strategies significantly impact AI fracture detection performance, underscoring the need for balanced and representative datasets to accurately assess and compare AI models in pediatric wrist radiography.

References

GRAZPEDWRI-DX Dataset (2023) -- Annotated Pediatric Wrist Radiographs
EfficientNet Models in Medical Imaging (2021-2023) -- Model Architectures and Applications