An early evaluation of MedSigLIP in thyroid cytology: a comparative frozen-encoder benchmark against ImageNet-pretrained encoders - Report - MDSpire

An early evaluation of MedSigLIP in thyroid cytology: a comparative frozen-encoder benchmark against ImageNet-pretrained encoders

  • By

  • Mehmet Poyrazer

  • Rıdvan Erten

  • April 10, 2026

  • 0 min

Share

Benchmarking MedSigLIP vs ImageNet Models in Thyroid Cytology Classification

Overview

This study compares domain-specific MedSigLIP and general ImageNet-pretrained models for thyroid FNAB cytology classification. While EfficientNet achieved the highest overall accuracy, MedSigLIP demonstrated superior calibration and sensitivity for the challenging Bethesda V category, suggesting benefits in clinical triage.

Background

Thyroid nodules are common, with fine-needle aspiration biopsy (FNAB) cytology serving as the primary diagnostic tool to stratify malignancy risk. The Bethesda System classifies cytology into categories ranging from benign to malignant, but indeterminate categories, especially Bethesda V (Suspicious for Malignancy), pose diagnostic challenges due to subjective interpretation and interobserver variability. Deep learning models pretrained on natural images are commonly used for classification, but domain-specific medical encoders like MedSigLIP may offer improved performance and reliability in this specialized context.

Data Highlights

ModelMacro-F1 (mean ± SD)Recall Bethesda VExpected Calibration Error (ECE)
EfficientNet-B00.845 ± 0.021Not specified0.044–0.082 (range for general encoders)
MedSigLIP0.836 ± 0.0190.8080.025
ResNet500.829 ± 0.015Not specified0.044–0.082
ViT-Base0.817 ± 0.020Not specified0.044–0.082

Key Findings

  • EfficientNet-B0 achieved the highest macro-F1 score (0.845), statistically outperforming ViT but not MedSigLIP.
  • MedSigLIP showed the highest recall (0.808) for the challenging Bethesda V (Suspicious) category.
  • MedSigLIP had the best calibration with the lowest Expected Calibration Error (ECE = 0.025) compared to general-purpose encoders (ECE range 0.044–0.082).
  • No statistically significant difference in overall classification accuracy was found between MedSigLIP and EfficientNet after correction for multiple comparisons.
  • Model calibration and sensitivity for borderline cases are critical metrics beyond aggregate accuracy for clinical utility.

Clinical Implications

In thyroid cytology workflows, selecting AI models should prioritize not only accuracy but also calibration and sensitivity for indeterminate Bethesda V cases to reduce overconfident misclassification. Well-calibrated models like MedSigLIP may enhance triage decisions and enable selective expert review, potentially improving patient management. Prospective validation in real-world clinical settings is needed to confirm these benefits.

Conclusion

MedSigLIP, a domain-specific medical pretrained encoder, offers improved calibration and sensitivity for suspicious thyroid cytology cases without compromising overall accuracy compared to ImageNet-pretrained models. These attributes support its potential role in enhancing clinical decision support for thyroid nodule evaluation.

References

  1. Google Health AI Developer Foundations/2024 -- MedSigLIP: Medical Image-Text Pretrained Encoder
  2. ThyroidEffi 1.0 Dataset/2024 -- Benchmark Dataset for Thyroid Cytology Classification

Original Source(s)

Related Content