Grounded report generation for enhancing ophthalmic ultrasound interpretation using Vision-Language Segmentation models

By
Kai Jin
Qixuan Sun
Daohuan Kang
Ziyao Luo
Tao Yu
Wenzheng Han
Yi Zhang
Meng Wang
Danli Shi
Andrzej Grzybowski
January 3, 2026
0 min

Npj Digital Medicine

Overview

This study presents a novel AI model integrating Vision-Language Segmentation (VLS) and the Segment Anything Model (SAM) to generate comprehensive diagnostic reports and precise lesion annotations from ophthalmic ultrasound images. Utilizing large datasets from multiple hospitals, the model demonstrated superior report generation, higher diagnostic accuracy, and reduced reporting time compared to traditional methods.

Background

Ophthalmic ultrasound is essential for diagnosing and managing various eye conditions, including retinal diseases and ocular tumors. However, interpreting these images is time-consuming and requires specialized expertise, which is challenged by increasing data volumes. Traditional AI models have improved image classification but lack the ability to generate detailed, interpretable reports. Recent advances in Vision-Language Models (VLM) and segmentation techniques offer promising avenues to enhance diagnostic precision and report generation in ophthalmology.

Data Highlights

Dataset	Patients	Images	Reports	Mean Age (years)	Gender Distribution (Male %)
Training	5497	37,917	12,649	~49.5	47.4%
Validation	1915	12,639	4197	~49.7	47.4%
Test	1919	12,640	4170	~49.6	47.4%
External Test Set 1 (FAHWM)	269	742	269	50.8	40.1%
External Test Set 2 (FAHZC)	70	160	70	57.4	45.7%
Total	9670	64,098	21,355

Key Findings

The integrated VLS model combining Vision-Language Models and SAM achieved superior performance in generating detailed ophthalmic ultrasound reports compared to baseline VL models.
AI-assisted reporting significantly improved diagnostic accuracy and reduced the time required for report generation.
The model effectively annotated lesions on images, enhancing interpretability and clinical utility.
Clinical evaluation by senior and junior ophthalmologists confirmed the model's effectiveness in real-world diagnostic settings.
The approach demonstrated scalability and potential applicability beyond ophthalmology to other medical imaging domains.

Clinical Implications

The integration of VLS and SAM in ophthalmic ultrasound analysis offers a practical tool to augment clinician workflow by providing accurate, interpretable reports and lesion annotations. This can reduce diagnostic workload and improve decision-making efficiency. Adoption of such AI-assisted systems may enhance patient care by enabling timely and precise diagnosis across diverse clinical settings.

Conclusion

This study demonstrates that combining advanced vision-language segmentation models with precise lesion annotation significantly advances ophthalmic ultrasound interpretation. The approach holds promise for broader application in medical imaging diagnostics, facilitating improved clinical outcomes through AI-augmented reporting.

References

Author/Source/Year -- Improving Ophthalmic Ultrasound Analysis through Grounded Report Generation with Vision-Language Segmentation Models

Grounded report generation for enhancing ophthalmic ultrasound interpretation using Vision-Language Segmentation models

Improving Ophthalmic Ultrasound Analysis with Vision-Language Segmentation Models

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Grounded report generation for enhancing ophthalmic ultrasound interpretation using Vision-Language Segmentation models

Related Content

Sex differences in inappropriate imaging requests: insights from the Medical Imaging Decision And Support (MIDAS) study

Are we systematically overdosing women? Revisiting standardized contrast protocols for thoracoabdominal CT scans

Automated vs manual cardiac MRI planning: a single-center prospective evaluation of reliability and scan times