Benchmark evaluation of multi-modal large language models for ophthalmic diagnosis in real world
-
By
-
Shoujun Huang
-
Junhong Chen
-
Jiaoman Wang
-
Ping Zhang
-
Wending Du
-
Yuan Hong
-
Dexing Kong
-
Wei Lou
-
Mingying Lai
-
Weihua Yang
-
June 22, 2026
-
Clinical Scorecard: Assessment of Multi-Modal Large Language Models for Ophthalmic Diagnosis in Real-World Settings
At a Glance
| Category | Detail |
| Condition | Ophthalmic Diagnosis |
| Key Mechanisms | Integration of image-based pattern recognition with textual clinical context. |
| Target Population | Patients with ophthalmic conditions requiring diagnostic evaluation. |
| Care Setting | Real-world clinical settings. |
Key Highlights
- Evaluation of nine leading MLLMs on a benchmark dataset of 295 ophthalmic cases.
- Models like HAIBU-ReMUD and ChatGPT-4o showed strong diagnostic accuracy.
- Focus on multimodal information integration and natural language reasoning.
- Dataset includes cases from peer-reviewed ophthalmology journals.
- Study addresses the gap in real-world performance evaluation of MLLMs.
Guideline-Based Recommendations
Diagnosis
- Utilize multimodal large language models for diagnostic classification in ophthalmology.
Management
- Incorporate MLLMs into clinical workflows for enhanced diagnostic support.
Monitoring & Follow-up
- Assess the performance of MLLMs in ongoing clinical applications.
Risks
- Consider limitations of MLLMs in specialized medical contexts.
Patient & Prescribing Data
Patients with diverse ophthalmic conditions.
MLLMs may assist in improving diagnostic accuracy and consistency.
Clinical Best Practices
- Employ a standardized assessment protocol for evaluating MLLM performance.
- Ensure high-quality clinical images and narratives in diagnostic cases.
Related Resources & Content