Benchmark evaluation of multi-modal large language models for ophthalmic diagnosis in real world

By
Shoujun Huang
Junhong Chen
Jiaoman Wang
Ping Zhang
Wending Du
Yuan Hong
Dexing Kong
Wei Lou
Mingying Lai
Weihua Yang
June 22, 2026
0 min

Frontiers In Medicine

Overview

This study evaluates the diagnostic capabilities of nine multimodal large language models (MLLMs) using a benchmark dataset of 295 ophthalmic cases.

Background

The integration of multimodal large language models in ophthalmology has the potential to enhance diagnostic accuracy by combining image analysis with clinical narratives. Despite advancements, rigorous evaluation of these models in real-world clinical settings remains limited, which is critical for their adoption in practice. This study aims to assess the performance of leading MLLMs on clinically relevant ophthalmic tasks.

Data Highlights

Model	Diagnostic Accuracy	Consistency
HAIBU-ReMUD	Strong	High
ChatGPT-4o	Strong	High

Key Findings

Evaluation of nine MLLMs on a dataset of 295 ophthalmic cases.
Models achieved diagnostic accuracy comparable to human experts.
HAIBU-ReMUD and ChatGPT-4o showed particularly strong performance.
Study utilized pathologically confirmed cases from leading ophthalmology journals.
Performance differences among models were systematically assessed.

Clinical Implications

Further investigation is warranted to explore the practical applications of MLLMs in real-world settings.

Conclusion

Continued research is essential to understand the integration of MLLMs into clinical practice.

Benchmark evaluation of multi-modal large language models for ophthalmic diagnosis in real world

Clinical Report: Assessment of Multi-Modal Large Language Models for Ophthalmic Diagnosis

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

Related Resources & Content

Original Source(s)

Benchmark evaluation of multi-modal large language models for ophthalmic diagnosis in real world

Related Content

Disease-specific data augmentation enhances deep learning classification of age-related macular degeneration, diabetic retinopathy, and glaucoma

EGS Leaders Chart the Future of Glaucoma Care

Building on AI Foundations