Clinical Scorecard: Key Performance Indicators for AI in Medical Imaging: Practice Guidelines from the European Society of Medical Imaging Informatics
At a Glance
Category
Detail
Condition
Artificial intelligence (AI) tools in medical imaging
Key Mechanisms
Evaluation of AI performance using task-specific metrics including segmentation, detection, classification, calibration, uncertainty quantification, and explainability
Target Population
Patients undergoing medical imaging across diverse clinical settings and demographics
Care Setting
Radiology departments and clinical workflows integrating AI diagnostic tools
Key Highlights
Locally validate AI tools beyond CE-marking using independent datasets reflecting institutional protocols and patient demographics.
Use a combination of segmentation, test-based, and outcome-based performance metrics to comprehensively assess AI diagnostic accuracy.
Consider deployment context by engaging clinicians to define relevant metrics and assess performance across clinically meaningful subgroups.
Guideline-Based Recommendations
Diagnosis
Apply task-specific metrics such as Dice similarity coefficient for segmentation and sensitivity/specificity for classification tasks.
Assess AI performance at multiple levels: pixel, region, scan, and patient to capture clinical relevance.
Incorporate calibration metrics (e.g., Brier score) and uncertainty quantification (e.g., conformal prediction) to understand model trustworthiness.
Management
Engage radiologists and clinicians in metric selection and interpretation to align AI evaluation with clinical goals and workflows.
Avoid reliance on single metrics vulnerable to class imbalance; report both test-based and outcome-based metrics.
Use independent, institution-specific datasets for local validation to ensure AI performance matches claimed results.
Monitoring & Follow-up
Continuously assess AI performance across vulnerable or clinically meaningful subgroups to detect variability.
Monitor calibration and uncertainty metrics to prevent overconfidence in AI predictions.
Standardize reporting of performance metrics to facilitate transparent and reproducible AI evaluation.
Risks
Inappropriate use or interpretation of metrics can mislead users, leading to overdiagnosis and increased healthcare costs.
Lack of standardized metric reporting places burden on end-users, risking flawed assessments and unsafe AI integration.
Limited real-world implementation of calibration and uncertainty methods may result in overestimation of AI performance.
Patient & Prescribing Data
Patients undergoing diagnostic imaging across various institutions with diverse demographics and disease prevalence
AI tools must be locally validated and evaluated using comprehensive, clinically relevant metrics to ensure reliable diagnostic support and safe integration into patient care pathways.
Clinical Best Practices
Validate AI tools locally with datasets independent of development data reflecting local imaging protocols and patient demographics.
Use a combination of segmentation, test-based, and outcome-based metrics to capture different aspects of AI performance.
Engage multidisciplinary clinical teams to define relevant performance metrics and interpret results within the deployment context.
Incorporate calibration and uncertainty quantification metrics to assess prediction reliability and avoid overconfidence.
Report performance metrics transparently and standardize metric usage to support informed clinical decision-making.
by Michail E. Klontzas, Kevin B. W. Groot Lipman, Tugba Akinci D’ Antonoli, Anna Andreychenko, Renato Cuocolo, Matthias Dietzel, Salvatore Gitto, Henkjan Huisman, João Santinha, Federica Vernuccio, Jacob J. Visser, Merel Huisman
International study of more than 19,000 patients finds substantial differences in radiation exposure from coronary artery disease imaging across modalities, regions, and income levels.