Clinical Report: Modality Contribution Metric in Multimodal Deep Learning for Medicine
Overview
A novel, model- and performance-agnostic metric quantifies the contribution of individual modalities in multimodal medical datasets processed by deep learning models. This metric enables detection of unimodal collapse and comparison of architectures across diverse medical imaging and tabular data combinations.
Background
Multimodal datasets combining imaging, text, and tabular patient data are increasingly prevalent in medical AI applications such as cancer prognosis, cardiovascular risk assessment, and diabetic retinopathy classification. Existing interpretability methods often lack the ability to quantify modality importance in a model-agnostic manner, limiting trust and clinical integration. Prior approaches depend on model architecture or performance, and attention-based methods inadequately measure modality contributions. There is a critical need for robust, interpretable metrics to explain multimodal deep learning models in clinical contexts.
Data Highlights
The metric was evaluated on three medical multimodal datasets: (1) 2D Chest X-Rays combined with clinical reports (Open I), (2) 2D color ophthalmological images plus patient information (BRSET), and (3) 3D head and neck CT scans plus patient information (Hecktor 22). The metric quantifies modality importance on a scale from 0 to 1, summing to 1 across all modalities, allowing detection of unimodal collapse and enabling architecture comparisons.
Key Findings
The proposed modality contribution metric is both model-agnostic (black-box) and performance-agnostic, overcoming limitations of prior methods.
It quantifies the importance of each modality by measuring output changes after systematic occlusion of modality-specific input features.
The metric can detect unimodal collapse, where a model relies excessively on a single modality, which may indicate suboptimal multimodal integration.
Application on diverse medical datasets combining imaging and tabular/text data demonstrated the metric's versatility and interpretability.
Masking strategies are modality-specific, e.g., pixel or voxel patches for imaging and individual features for tabular data, balancing interpretability and computational efficiency.
Clinical Implications
This modality contribution metric enhances interpretability of multimodal AI models in clinical practice, fostering trust by clarifying how different data sources influence predictions. It supports model selection and validation by identifying overreliance on single modalities, potentially improving diagnostic accuracy and robustness. Ultimately, it may accelerate safe integration of multimodal AI tools into patient care workflows.
Conclusion
The development of a performance- and model-agnostic modality contribution metric addresses a critical gap in interpreting multimodal deep learning models for medical applications. This approach facilitates transparent evaluation and comparison of multimodal architectures, promoting trust and clinical adoption.
References
Open I Dataset [1]
BRSET Dataset [23]
Hecktor 22 MICCAI Challenge [22]
GradCAM Method [27]
Occlusion Sensitivity [30]
Breiman 2001 -- Permutation Importance [5]
Fisher et al. -- Extension of Permutation Importance [8]
SHAP Method [20]
LIME Method [26]
Ngiam et al. 2011 -- Multimodal Deep Learning [24]