Evaluation of AI Model Version Updates for Mammogram Analysis in BreastScreen Norway
Overview
This study compared two versions of a commercial AI model for mammography interpretation in a large national screening program. The updated AI version showed changes in malignancy risk scoring that impacted screening performance metrics, including screen-detected and interval cancer rates. Differences in tumor characteristics and mammographic features were also analyzed relative to AI risk scores.
Background
Breast cancer screening with mammography is evolving with the integration of artificial intelligence (AI) to improve interpretation accuracy and outcomes. Several AI models have regulatory clearance and are used to support radiologists by triaging exams or supplementing double reading. However, challenges remain regarding the impact of AI software updates on screening performance, ethical and legal considerations, and implementation costs. Understanding how successive AI model versions affect clinical outcomes is critical for optimizing screening programs.
Data Highlights
Parameter
Version 1.7
Version 2.1
Number of Screening Exams
~117,709
~117,709
AI Risk Score Categories
1–7 low, 8–9 intermediate, 10 high
Same categorization
Screen-Detected Cancer Definition
Invasive cancer or DCIS after recall
Same
Interval Cancer Definition
Invasive cancer or DCIS within 24 months post-negative screen
Same
AI Model Updates
Original algorithm
Architectural changes, expanded training data, updated sampling
Key Findings
Version 2.1 of the AI model incorporated architectural algorithm changes and expanded training data from multiple vendors and centers.
AI malignancy risk scores from version 2.1 differed from version 1.7, potentially altering thresholds for suspicious findings.
Screen-detected and interval cancer rates varied when using AI scores from the two versions, indicating impact on screening outcomes.
Histopathological tumor characteristics and mammographic features showed differences in distribution relative to AI risk scores between versions.
Version 2.1 demonstrated improved sensitivity compared to humans and prior AI versions in detecting malignancies.
Clinical Implications
Clinicians should be aware that updates to AI mammography models can change malignancy risk scoring and influence screening performance metrics. Continuous validation and quality assurance are essential when implementing new AI software versions to maintain or improve cancer detection rates and minimize false positives. Understanding these changes aids in optimizing recall decisions and resource allocation in breast cancer screening programs.
Conclusion
This study highlights that version updates in commercial AI models for mammography can significantly impact interpretive performance and screening outcomes. Ongoing evaluation of AI software versions is crucial to ensure consistent and improved breast cancer detection in population-based screening.
References
BreastScreen Norway Program Data and Ethics Approvals
ScreenPoint Medical BV - Transpara AI Model Versions 1.7 and 2.1
by Marthe Larsen, Christoph I. Lee, Marie B. Bergan, Åsne S. Holen, Håkon Lund-Hanssen, Solveig R. Hoff, Steinar Auensen, Jan F. Nygård, Kristina Lång, Yan Chen, Giske Ursin, Solveig Hofvind