3DINO-ViT: A Versatile Self-Supervised 3D Model for Medical Imaging
Overview
3DINO-ViT is a novel self-supervised learning framework pretrained on nearly 100,000 multimodal 3D medical scans across over 10 organs. It outperforms state-of-the-art pretrained models on diverse downstream segmentation and classification tasks, demonstrating strong generalizability and scalability.
Background
Deep learning has shown promise in medical imaging tasks such as detection, diagnosis, and risk profiling, but requires large labeled datasets which are costly and time-consuming to obtain, especially for 3D modalities. Self-supervised learning (SSL) reduces dependence on labeled data by leveraging unlabeled datasets, yet existing SSL methods for 3D medical imaging often rely on simple pretext tasks and organ- or modality-specific datasets, limiting their generalizability. The 3DINO framework adapts the DINOv2 SSL pipeline to 3D inputs and introduces a general-purpose Vision Transformer (ViT) pretrained on a large, diverse dataset to overcome these limitations.
Data Highlights
Dataset Type
Number of Volumes
MRI
70,434
CT
27,815
Brain PET
566
Total 3D Scans
~100,000
Key Findings
3DINO-ViT pretrained on a large, multimodal dataset achieves superior performance compared to six other initialization methods including random initialization, Swin ViT pretrained models, and masked image modeling approaches.
The framework combines image-level and patch-level objectives with multiple augmentations per scan to extract salient features for both segmentation and classification tasks.
3DINO-ViT demonstrates strong generalizability to out-of-distribution organs and modalities, such as left atrium MRI and 3D breast ultrasound tumor segmentation.
Performance was validated on multiple benchmarks including BraTS (brain tumor MRI segmentation), BTCV (CT abdominal organ segmentation), brain age classification, and COVID-CT-MD lung CT classification.
The model incorporates a 3D ViT-Adapter module to inject spatial inductive biases, enhancing downstream segmentation accuracy.
Clinical Implications
The availability of a general-purpose pretrained 3D model like 3DINO-ViT can significantly reduce the need for large labeled datasets in medical imaging, facilitating broader adoption of deep learning in clinical workflows. Its strong performance across multiple organs, modalities, and tasks suggests it can serve as a foundational model to accelerate development of diagnostic and prognostic tools in diverse clinical scenarios.
Conclusion
3DINO-ViT represents a scalable and versatile self-supervised learning approach that effectively leverages large unlabeled 3D medical imaging datasets to produce robust, generalizable representations. This advancement holds promise for improving accuracy and efficiency in a wide range of medical imaging applications.