Anatomy-guided visual prompt tuning for cross-modal breast cancer understanding

By
Shaorong Zhao
Qingxiang Meng
Yang He
Xiaotong Xu
Jiayao Zhu
Jiawen Qiu
Chao Wu
Yamei Han
Jinhai Deng
Teng Pan
Jingjing Liu
February 13, 2026
0 min

Npj Digital Medicine

Overview

This study introduces A-VPT, a novel anatomy-guided visual prompt tuning framework that integrates explicit anatomical priors into Vision Transformer models for breast cancer imaging. A-VPT demonstrates state-of-the-art performance in lesion classification and segmentation across mammography, ultrasound, and MRI datasets while using minimal tunable parameters.

Background

Breast cancer detection across different imaging modalities is challenging due to lesion heterogeneity and lack of cross-domain consistency. Vision Transformers (ViTs) with parameter-efficient fine-tuning have advanced model adaptation but often lack incorporation of domain-specific anatomical knowledge. Embedding anatomical priors into deep learning models may improve interpretability and generalization. This work proposes a method to integrate glandular, fatty, and ductal tissue information directly into the prompt space of ViTs to enhance cross-modal breast cancer analysis.

Data Highlights

Dataset	Modality	Task	Performance	Tunable Parameters (%)
INbreast	Mammography	Lesion Classification & Segmentation	State-of-the-art	<2%
BUSI	Ultrasound	Lesion Classification & Segmentation	State-of-the-art	<2%
Duke-Breast-MRI	MRI	Lesion Classification & Segmentation	State-of-the-art	<2%

Key Findings

A-VPT dynamically generates tissue-aware prompts guided by glandular, fatty, and ductal region embeddings within a frozen Vision Transformer backbone.
Hierarchical prompt-token interactions across transformer layers enhance anatomical semantic integration.
Cross-modal contrastive alignment harmonizes anatomical semantics among mammography, ultrasound, and MRI modalities.
A-VPT achieves state-of-the-art lesion classification and segmentation performance on three benchmark datasets using less than 2% of tunable parameters compared to full fine-tuning.
Qualitative analyses reveal interpretable attention patterns consistent with radiological anatomical structures.
Embedding anatomical priors improves model efficiency, generalization, and interpretability bridging deep learning with human anatomical reasoning.

Clinical Implications

Incorporating explicit anatomical knowledge into AI models can enhance breast cancer detection accuracy across multiple imaging modalities while reducing computational resources. The interpretable attention maps aligned with anatomical structures may increase clinician trust and facilitate integration into diagnostic workflows. This approach supports robust multi-domain generalization, potentially improving early and reliable breast cancer diagnosis.

Conclusion

Anatomy-guided visual prompt tuning represents a promising strategy to improve cross-modal breast cancer imaging analysis by embedding domain-specific anatomical priors. This method advances both performance and interpretability while maintaining parameter efficiency.

References

Moreira et al. 2012 -- INbreast: toward a full-field digital mammographic database
Al-Dhabyani et al. 2020 -- Dataset of breast ultrasound images
Saha et al. 2021 -- Dynamic contrast-enhanced magnetic resonance images of breast cancer patients
Dosovitskiy et al. 2021 -- An image is worth 16x16 words: transformers for image recognition at scale
Jia et al. 2022 -- Visual prompt tuning