ESR Essentials: common performance metrics in AI—practice recommendations by the European Society of Medical Imaging Informatics

By
Michail E. Klontzas
Kevin B. W. Groot Lipman
Tugba Akinci D’ Antonoli
Anna Andreychenko
Renato Cuocolo
Matthias Dietzel
Salvatore Gitto
Henkjan Huisman
João Santinha
Federica Vernuccio
Jacob J. Visser
Merel Huisman
August 3, 2025
0 min

European Radiology

Overview

The European Society of Medical Imaging Informatics provides consensus guidelines emphasizing the importance of locally validating AI tools using task-specific performance metrics. These guidelines recommend assessing AI performance within the intended clinical context, employing both test-based and outcome-based metrics to ensure reliable diagnostic utility.

Background

Artificial intelligence is increasingly integrated into radiology, necessitating comprehensive evaluation of algorithm performance to ensure safe clinical use. Traditional metrics like accuracy and sensitivity are commonly used but may not fully capture real-world AI performance, especially in complex or low-prevalence scenarios. Misuse or overreliance on single metrics can mislead users and obscure algorithm limitations, potentially causing overdiagnosis and increased healthcare costs. Radiologists often face challenges interpreting AI performance metrics without standardized guidance, underscoring the need for clear practice recommendations.

Data Highlights

Key performance metrics recommended include segmentation metrics (Dice similarity coefficient, normalized surface distance), test-based metrics (sensitivity, specificity, area under the ROC curve), and outcome-based metrics (precision, negative predictive value, F1-score, Matthew’s correlation coefficient, area under the PR curve). Additional important metrics include calibration metrics like the Brier score and uncertainty quantification methods such as conformal prediction.

Key Findings

Local validation of AI tools using independent datasets reflecting institutional protocols and patient demographics is essential to confirm claimed performance (moderate evidence).
Task-specific metrics must be used: segmentation metrics for pixel-level accuracy, test-based metrics for distinguishing conditions, and outcome-based metrics for real-world diagnostic reliability.
Performance assessment should consider the clinical deployment context, involving radiologists and clinicians to define relevant metrics and account for local disease prevalence and vulnerable subgroups.
Calibration and uncertainty quantification metrics are critical to understanding model behavior and avoiding overconfident predictions, though their real-world implementation remains limited.
AI evaluation must be tailored to the level of input data (pixel, region, image, patient) and clinical problem, often requiring combinations of metrics for comprehensive assessment.
Misuse of metrics, such as overemphasis on single measures or ignoring class imbalance, can mislead users and hinder safe AI integration.

Clinical Implications

Clinicians should ensure AI tools are validated locally with datasets representative of their patient populations and workflows. Employing a combination of task-specific metrics and involving multidisciplinary teams in metric selection enhances the reliability and clinical relevance of AI performance assessments. Awareness of calibration and uncertainty metrics can help prevent overreliance on potentially misleading AI outputs, promoting safer adoption in practice.

Conclusion

Robust, context-aware evaluation of AI performance using a comprehensive set of validated metrics is critical for the safe and effective integration of AI in medical imaging. These guidelines equip radiologists with the knowledge to interpret AI metrics appropriately and make informed decisions about diagnostic AI tools.

References

European Society of Medical Imaging Informatics -- Key Performance Indicators for AI in Medical Imaging: Practice Guidelines

ESR Essentials: common performance metrics in AI—practice recommendations by the European Society of Medical Imaging Informatics

Practice Guidelines for AI Performance Metrics in Medical Imaging

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

ESR Essentials: common performance metrics in AI—practice recommendations by the European Society of Medical Imaging Informatics

Related Content

The Loan Cap That Could Shrink the Doctor Pipeline

Infections Lead Thoracic CT Findings in SLE

Multimodal brain network topology and enhanced computer-aided diagnosis in Parkinson’s Disease: a systematic review and meta-analysis