Benchmark evaluation of multi-modal large language models for ophthalmic diagnosis in real world - Summary - MDSpire

Benchmark evaluation of multi-modal large language models for ophthalmic diagnosis in real world

  • By

  • Shoujun Huang

  • Junhong Chen

  • Jiaoman Wang

  • Ping Zhang

  • Wending Du

  • Yuan Hong

  • Dexing Kong

  • Wei Lou

  • Mingying Lai

  • Weihua Yang

  • June 22, 2026

  • 0 min

Share

Objective:

To evaluate the diagnostic capabilities of multi-modal large language models (MLLMs) in ophthalmology using a curated benchmark dataset.

Approach:
    Key Findings:
    • Models such as HAIBU-ReMUD and ChatGPT-4o achieved strong diagnostic accuracy and consistency.
    • Performance of some models approached that of human experts in specific settings.
    Interpretation:

    Limitations:
    • The evaluation focused on models not primarily optimized for ophthalmology.
    • Performance may vary across different clinical contexts and specific ophthalmic tasks.
    Conclusion:

    The study provides a foundation for further exploration of MLLMs in ophthalmic diagnosis.

Original Source(s)

Related Content