What are you looking at? Modality contribution in multimodal medical deep learning - Report - MDSpire

What are you looking at? Modality contribution in multimodal medical deep learning

  • By

  • Christian Gapp

  • Elias Tappeiner

  • Martin Welk

  • Karl Fritscher

  • Elke R. Gizewski

  • Rainer Schubert

  • October 2, 2025

  • 0 min

Share

Clinical Report: Modality Contribution Metric in Multimodal Deep Learning for Medicine

Overview

A novel, model- and performance-agnostic metric quantifies the contribution of individual modalities in multimodal medical datasets processed by deep learning models. This metric enables detection of unimodal collapse and comparison of architectures across diverse medical imaging and tabular data combinations.

Background

Multimodal datasets combining imaging, text, and tabular patient data are increasingly prevalent in medical AI applications such as cancer prognosis, cardiovascular risk assessment, and diabetic retinopathy classification. Existing interpretability methods often lack the ability to quantify modality importance in a model-agnostic manner, limiting trust and clinical integration. Prior approaches depend on model architecture or performance, and attention-based methods inadequately measure modality contributions. There is a critical need for robust, interpretable metrics to explain multimodal deep learning models in clinical contexts.

Data Highlights

The metric was evaluated on three medical multimodal datasets: (1) 2D Chest X-Rays combined with clinical reports (Open I), (2) 2D color ophthalmological images plus patient information (BRSET), and (3) 3D head and neck CT scans plus patient information (Hecktor 22). The metric quantifies modality importance on a scale from 0 to 1, summing to 1 across all modalities, allowing detection of unimodal collapse and enabling architecture comparisons.

Key Findings

  • The proposed modality contribution metric is both model-agnostic (black-box) and performance-agnostic, overcoming limitations of prior methods.
  • It quantifies the importance of each modality by measuring output changes after systematic occlusion of modality-specific input features.
  • The metric can detect unimodal collapse, where a model relies excessively on a single modality, which may indicate suboptimal multimodal integration.
  • Application on diverse medical datasets combining imaging and tabular/text data demonstrated the metric's versatility and interpretability.
  • Masking strategies are modality-specific, e.g., pixel or voxel patches for imaging and individual features for tabular data, balancing interpretability and computational efficiency.

Clinical Implications

This modality contribution metric enhances interpretability of multimodal AI models in clinical practice, fostering trust by clarifying how different data sources influence predictions. It supports model selection and validation by identifying overreliance on single modalities, potentially improving diagnostic accuracy and robustness. Ultimately, it may accelerate safe integration of multimodal AI tools into patient care workflows.

Conclusion

The development of a performance- and model-agnostic modality contribution metric addresses a critical gap in interpreting multimodal deep learning models for medical applications. This approach facilitates transparent evaluation and comparison of multimodal architectures, promoting trust and clinical adoption.

References

  1. Open I Dataset [1]
  2. BRSET Dataset [23]
  3. Hecktor 22 MICCAI Challenge [22]
  4. GradCAM Method [27]
  5. Occlusion Sensitivity [30]
  6. Breiman 2001 -- Permutation Importance [5]
  7. Fisher et al. -- Extension of Permutation Importance [8]
  8. SHAP Method [20]
  9. LIME Method [26]
  10. Ngiam et al. 2011 -- Multimodal Deep Learning [24]

Original Source(s)

Related Content