Benchmark evaluation of multi-modal large language models for ophthalmic diagnosis in real world - Report - MDSpire

Benchmark evaluation of multi-modal large language models for ophthalmic diagnosis in real world

  • By

  • Shoujun Huang

  • Junhong Chen

  • Jiaoman Wang

  • Ping Zhang

  • Wending Du

  • Yuan Hong

  • Dexing Kong

  • Wei Lou

  • Mingying Lai

  • Weihua Yang

  • June 22, 2026

  • 0 min

Share

Clinical Report: Assessment of Multi-Modal Large Language Models for Ophthalmic Diagnosis

Overview

This study evaluates the diagnostic capabilities of nine multimodal large language models (MLLMs) using a benchmark dataset of 295 ophthalmic cases.

Background

The integration of multimodal large language models in ophthalmology has the potential to enhance diagnostic accuracy by combining image analysis with clinical narratives. Despite advancements, rigorous evaluation of these models in real-world clinical settings remains limited, which is critical for their adoption in practice. This study aims to assess the performance of leading MLLMs on clinically relevant ophthalmic tasks.

Data Highlights

ModelDiagnostic AccuracyConsistency
HAIBU-ReMUDStrongHigh
ChatGPT-4oStrongHigh

Key Findings

  • Evaluation of nine MLLMs on a dataset of 295 ophthalmic cases.
  • Models achieved diagnostic accuracy comparable to human experts.
  • HAIBU-ReMUD and ChatGPT-4o showed particularly strong performance.
  • Study utilized pathologically confirmed cases from leading ophthalmology journals.
  • Performance differences among models were systematically assessed.

Clinical Implications

Further investigation is warranted to explore the practical applications of MLLMs in real-world settings.

Conclusion

Continued research is essential to understand the integration of MLLMs into clinical practice.

Related Resources & Content

  1. Eye, Nature, 2026 -- Performance of large language models for ophthalmic literature retrieval
  2. Frontiers in Medicine, 2026 -- Evaluation of multimodal large language models for psoriasis diagnosis
  3. npj Digital Medicine, 2025 -- A Comprehensive Benchmark and Multimodal Foundation Model for Analyzing Retinal OCT Images
  4. npj Digital Medicine, 2025 -- Enhancing clinical documentation with voice processing and large language models
  5. Primary Open-Angle Glaucoma Preferred Practice Pattern®, Ophthalmology
  6. Evaluating Large Language Models in Ophthalmology: Systematic Review - PMC
  7. Primary Open-Angle Glaucoma Preferred Practice Pattern® - Ophthalmology
  8. Evaluating Large Language Models in Ophthalmology: Systematic Review - PMC

Original Source(s)

Related Content