Benchmarking MedSigLIP vs ImageNet Models in Thyroid Cytology Classification
Overview
This study compares domain-specific MedSigLIP and general ImageNet-pretrained models for thyroid FNAB cytology classification. While EfficientNet achieved the highest overall accuracy, MedSigLIP demonstrated superior calibration and sensitivity for the challenging Bethesda V category, suggesting benefits in clinical triage.
Background
Thyroid nodules are common, with fine-needle aspiration biopsy (FNAB) cytology serving as the primary diagnostic tool to stratify malignancy risk. The Bethesda System classifies cytology into categories ranging from benign to malignant, but indeterminate categories, especially Bethesda V (Suspicious for Malignancy), pose diagnostic challenges due to subjective interpretation and interobserver variability. Deep learning models pretrained on natural images are commonly used for classification, but domain-specific medical encoders like MedSigLIP may offer improved performance and reliability in this specialized context.
Data Highlights
Model
Macro-F1 (mean ± SD)
Recall Bethesda V
Expected Calibration Error (ECE)
EfficientNet-B0
0.845 ± 0.021
Not specified
0.044–0.082 (range for general encoders)
MedSigLIP
0.836 ± 0.019
0.808
0.025
ResNet50
0.829 ± 0.015
Not specified
0.044–0.082
ViT-Base
0.817 ± 0.020
Not specified
0.044–0.082
Key Findings
EfficientNet-B0 achieved the highest macro-F1 score (0.845), statistically outperforming ViT but not MedSigLIP.
MedSigLIP showed the highest recall (0.808) for the challenging Bethesda V (Suspicious) category.
MedSigLIP had the best calibration with the lowest Expected Calibration Error (ECE = 0.025) compared to general-purpose encoders (ECE range 0.044–0.082).
No statistically significant difference in overall classification accuracy was found between MedSigLIP and EfficientNet after correction for multiple comparisons.
Model calibration and sensitivity for borderline cases are critical metrics beyond aggregate accuracy for clinical utility.
Clinical Implications
In thyroid cytology workflows, selecting AI models should prioritize not only accuracy but also calibration and sensitivity for indeterminate Bethesda V cases to reduce overconfident misclassification. Well-calibrated models like MedSigLIP may enhance triage decisions and enable selective expert review, potentially improving patient management. Prospective validation in real-world clinical settings is needed to confirm these benefits.
Conclusion
MedSigLIP, a domain-specific medical pretrained encoder, offers improved calibration and sensitivity for suspicious thyroid cytology cases without compromising overall accuracy compared to ImageNet-pretrained models. These attributes support its potential role in enhancing clinical decision support for thyroid nodule evaluation.
References
Google Health AI Developer Foundations/2024 -- MedSigLIP: Medical Image-Text Pretrained Encoder
ThyroidEffi 1.0 Dataset/2024 -- Benchmark Dataset for Thyroid Cytology Classification