To establish a rigorous, leakage-free benchmarking framework for binary breast cancer histopathology classification using deep learning models.
Key Findings:
All architectures achieved comparable performance with mean accuracies between 0.91-0.93.
ResNet50 had the highest mean accuracy (0.9267 ± 0.0435) and F1-score (0.9472).
No statistically significant differences were found among models (p > 0.05 after correction).
Intermediate magnifications (40× and 200×) provided better discriminative features compared to higher magnification (400×).
Interpretation:
Architectural differences among modern deep learning models do not lead to significant performance variations; evaluation design is crucial for reliable outcomes.
Limitations:
The study is limited to the BreaKHis dataset, which may not generalize to all histopathological contexts.
Only binary classification was addressed, limiting the applicability to multi-class scenarios.
Conclusion:
The proposed patient-aware benchmarking framework enhances reproducibility and supports the development of clinically translatable AI systems for breast cancer diagnosis.