Impact of test set composition on AI performance in pediatric wrist fracture detection in X-rays - Report - MDSpire

Impact of test set composition on AI performance in pediatric wrist fracture detection in X-rays

  • By

  • Tristan Till

  • Mario Scherkl

  • Nikolaus Stranger

  • Georg Singer

  • Saskia Hankel

  • Christina Flucher

  • Franko Hržić

  • Ivan Štajduhar

  • Sebastian Tschauner

  • May 16, 2025

  • 0 min

Share

Impact of Test Set Sampling on AI Accuracy for Pediatric Wrist Fracture Detection

Overview

This study demonstrates that the composition of test sets, particularly the balance of case difficulty, significantly influences the performance metrics of AI models detecting pediatric wrist fractures on X-rays. Two test sets—one balanced for difficulty and fracture presence, and one randomly sampled—were used to evaluate deep-learning models, revealing notable differences in accuracy and detection capabilities.

Background

Artificial intelligence applications in musculoskeletal imaging, such as fracture detection in pediatric wrist radiographs, rely heavily on the quality and characteristics of test datasets to validate model performance. Ideally, test sets should be independent and representative of clinical variability to ensure generalizability. Previous studies often used randomly selected or consecutive case series without accounting for case difficulty, which may inflate performance metrics by including easily recognizable fractures. This study investigates how sampling strategies based on case complexity affect AI model accuracy.

Data Highlights

Test SetTotal ImagesFracture Cases (%)Difficult Cases (%)
Balanced458850%25%
Random4588Similar to full dataset6%
Overlap1022Not specifiedNot specified

Key Findings

  • The balanced test set was constructed to include 50% fracture cases with half of those being difficult, while the random set reflected the natural distribution with only 6% difficult cases.
  • Patient demographics such as gender, laterality, and initial examination rates were comparable between the balanced and random test sets.
  • Two standard computer vision algorithms (EfficientNet variants B0 to B7) were trained identically and tested on both sets to evaluate performance differences.
  • Performance metrics varied significantly depending on the test set composition, with models showing reduced accuracy on the balanced set containing more difficult cases.
  • The study highlights that random sampling may overestimate AI performance by underrepresenting challenging fracture cases.

Clinical Implications

Clinicians and developers should recognize that AI model performance metrics are highly dependent on the test set characteristics, especially case difficulty distribution. For meaningful evaluation and regulatory approval, test sets should be carefully curated to include a representative spectrum of fracture complexities. This approach ensures AI tools are robust and clinically useful in detecting subtle or difficult fractures, not just obvious cases.

Conclusion

The study confirms that test set sampling strategies significantly impact AI fracture detection performance, underscoring the need for balanced and representative datasets to accurately assess and compare AI models in pediatric wrist radiography.

References

  1. GRAZPEDWRI-DX Dataset (2023) -- Annotated Pediatric Wrist Radiographs
  2. EfficientNet Models in Medical Imaging (2021-2023) -- Model Architectures and Applications

Original Source(s)

Related Content