A generalizable 3D framework and model for self-supervised learning in medical imaging

By
Tony Xu
Sepehr Hosseini
Chris Anderson
Anthony Rinaldi
Rahul G. Krishnan
Anne L. Martel
Maged Goubran
November 7, 2025
0 min

Npj Digital Medicine

Overview

3DINO-ViT is a novel self-supervised learning framework pretrained on nearly 100,000 multimodal 3D medical scans across over 10 organs. It outperforms state-of-the-art pretrained models on diverse downstream segmentation and classification tasks, demonstrating strong generalizability and scalability.

Background

Deep learning has shown promise in medical imaging tasks such as detection, diagnosis, and risk profiling, but requires large labeled datasets which are costly and time-consuming to obtain, especially for 3D modalities. Self-supervised learning (SSL) reduces dependence on labeled data by leveraging unlabeled datasets, yet existing SSL methods for 3D medical imaging often rely on simple pretext tasks and organ- or modality-specific datasets, limiting their generalizability. The 3DINO framework adapts the DINOv2 SSL pipeline to 3D inputs and introduces a general-purpose Vision Transformer (ViT) pretrained on a large, diverse dataset to overcome these limitations.

Data Highlights

Dataset Type	Number of Volumes
MRI	70,434
CT	27,815
Brain PET	566
Total 3D Scans	~100,000

Key Findings

3DINO-ViT pretrained on a large, multimodal dataset achieves superior performance compared to six other initialization methods including random initialization, Swin ViT pretrained models, and masked image modeling approaches.
The framework combines image-level and patch-level objectives with multiple augmentations per scan to extract salient features for both segmentation and classification tasks.
3DINO-ViT demonstrates strong generalizability to out-of-distribution organs and modalities, such as left atrium MRI and 3D breast ultrasound tumor segmentation.
Performance was validated on multiple benchmarks including BraTS (brain tumor MRI segmentation), BTCV (CT abdominal organ segmentation), brain age classification, and COVID-CT-MD lung CT classification.
The model incorporates a 3D ViT-Adapter module to inject spatial inductive biases, enhancing downstream segmentation accuracy.

Clinical Implications

The availability of a general-purpose pretrained 3D model like 3DINO-ViT can significantly reduce the need for large labeled datasets in medical imaging, facilitating broader adoption of deep learning in clinical workflows. Its strong performance across multiple organs, modalities, and tasks suggests it can serve as a foundational model to accelerate development of diagnostic and prognostic tools in diverse clinical scenarios.

Conclusion

3DINO-ViT represents a scalable and versatile self-supervised learning approach that effectively leverages large unlabeled 3D medical imaging datasets to produce robust, generalizable representations. This advancement holds promise for improving accuracy and efficiency in a wide range of medical imaging applications.

References

AICONSlab/3DINO -- 3DINO GitHub Repository
Tang et al. 2023 -- Sliding Window Swin ViT for 3D Medical Imaging
BraTS Challenge 2021 -- Brain Tumor Segmentation Benchmark
BTCV Challenge -- Beyond the Cranial Vault Abdominal Organ Segmentation
DINOv2 SSL Pipeline -- Self-Supervised Learning Method

A generalizable 3D framework and model for self-supervised learning in medical imaging

3DINO-ViT: A Versatile Self-Supervised 3D Model for Medical Imaging

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

A generalizable 3D framework and model for self-supervised learning in medical imaging

Related Content

STD-Net: a spatio-temporal decoupling network for multiphasic liver lesion segmentation and characterization

Interpretable deep learning for multicenter gastric cancer T staging from CT images

Deep multimodal state-space fusion of endoscopic-radiomic and clinical data for survival prediction in colorectal cancer