A generalizable 3D framework and model for self-supervised learning in medical imaging - Report - MDSpire

A generalizable 3D framework and model for self-supervised learning in medical imaging

  • By

  • Tony Xu

  • Sepehr Hosseini

  • Chris Anderson

  • Anthony Rinaldi

  • Rahul G. Krishnan

  • Anne L. Martel

  • Maged Goubran

  • November 7, 2025

  • 0 min

Share

3DINO-ViT: A Versatile Self-Supervised 3D Model for Medical Imaging

Overview

3DINO-ViT is a novel self-supervised learning framework pretrained on nearly 100,000 multimodal 3D medical scans across over 10 organs. It outperforms state-of-the-art pretrained models on diverse downstream segmentation and classification tasks, demonstrating strong generalizability and scalability.

Background

Deep learning has shown promise in medical imaging tasks such as detection, diagnosis, and risk profiling, but requires large labeled datasets which are costly and time-consuming to obtain, especially for 3D modalities. Self-supervised learning (SSL) reduces dependence on labeled data by leveraging unlabeled datasets, yet existing SSL methods for 3D medical imaging often rely on simple pretext tasks and organ- or modality-specific datasets, limiting their generalizability. The 3DINO framework adapts the DINOv2 SSL pipeline to 3D inputs and introduces a general-purpose Vision Transformer (ViT) pretrained on a large, diverse dataset to overcome these limitations.

Data Highlights

Dataset TypeNumber of Volumes
MRI70,434
CT27,815
Brain PET566
Total 3D Scans~100,000

Key Findings

  • 3DINO-ViT pretrained on a large, multimodal dataset achieves superior performance compared to six other initialization methods including random initialization, Swin ViT pretrained models, and masked image modeling approaches.
  • The framework combines image-level and patch-level objectives with multiple augmentations per scan to extract salient features for both segmentation and classification tasks.
  • 3DINO-ViT demonstrates strong generalizability to out-of-distribution organs and modalities, such as left atrium MRI and 3D breast ultrasound tumor segmentation.
  • Performance was validated on multiple benchmarks including BraTS (brain tumor MRI segmentation), BTCV (CT abdominal organ segmentation), brain age classification, and COVID-CT-MD lung CT classification.
  • The model incorporates a 3D ViT-Adapter module to inject spatial inductive biases, enhancing downstream segmentation accuracy.

Clinical Implications

The availability of a general-purpose pretrained 3D model like 3DINO-ViT can significantly reduce the need for large labeled datasets in medical imaging, facilitating broader adoption of deep learning in clinical workflows. Its strong performance across multiple organs, modalities, and tasks suggests it can serve as a foundational model to accelerate development of diagnostic and prognostic tools in diverse clinical scenarios.

Conclusion

3DINO-ViT represents a scalable and versatile self-supervised learning approach that effectively leverages large unlabeled 3D medical imaging datasets to produce robust, generalizable representations. This advancement holds promise for improving accuracy and efficiency in a wide range of medical imaging applications.

References

  1. AICONSlab/3DINO -- 3DINO GitHub Repository
  2. Tang et al. 2023 -- Sliding Window Swin ViT for 3D Medical Imaging
  3. BraTS Challenge 2021 -- Brain Tumor Segmentation Benchmark
  4. BTCV Challenge -- Beyond the Cranial Vault Abdominal Organ Segmentation
  5. DINOv2 SSL Pipeline -- Self-Supervised Learning Method

Original Source(s)

Related Content