Impact of Test Set Sampling on AI Accuracy for Pediatric Wrist Fracture Detection
Overview
This study demonstrates that the composition of test sets, particularly the balance of case difficulty, significantly influences the performance metrics of AI models detecting pediatric wrist fractures on X-rays. Two test sets—one balanced for difficulty and fracture presence, and one randomly sampled—were used to evaluate deep-learning models, revealing notable differences in accuracy and detection capabilities.
Background
Artificial intelligence applications in musculoskeletal imaging, such as fracture detection in pediatric wrist radiographs, rely heavily on the quality and characteristics of test datasets to validate model performance. Ideally, test sets should be independent and representative of clinical variability to ensure generalizability. Previous studies often used randomly selected or consecutive case series without accounting for case difficulty, which may inflate performance metrics by including easily recognizable fractures. This study investigates how sampling strategies based on case complexity affect AI model accuracy.
Data Highlights
Test Set
Total Images
Fracture Cases (%)
Difficult Cases (%)
Balanced
4588
50%
25%
Random
4588
Similar to full dataset
6%
Overlap
1022
Not specified
Not specified
Key Findings
The balanced test set was constructed to include 50% fracture cases with half of those being difficult, while the random set reflected the natural distribution with only 6% difficult cases.
Patient demographics such as gender, laterality, and initial examination rates were comparable between the balanced and random test sets.
Two standard computer vision algorithms (EfficientNet variants B0 to B7) were trained identically and tested on both sets to evaluate performance differences.
Performance metrics varied significantly depending on the test set composition, with models showing reduced accuracy on the balanced set containing more difficult cases.
The study highlights that random sampling may overestimate AI performance by underrepresenting challenging fracture cases.
Clinical Implications
Clinicians and developers should recognize that AI model performance metrics are highly dependent on the test set characteristics, especially case difficulty distribution. For meaningful evaluation and regulatory approval, test sets should be carefully curated to include a representative spectrum of fracture complexities. This approach ensures AI tools are robust and clinically useful in detecting subtle or difficult fractures, not just obvious cases.
Conclusion
The study confirms that test set sampling strategies significantly impact AI fracture detection performance, underscoring the need for balanced and representative datasets to accurately assess and compare AI models in pediatric wrist radiography.