Benchmark evaluation of multi-modal large language models for ophthalmic diagnosis in real world - Takeaways - MDSpire

Benchmark evaluation of multi-modal large language models for ophthalmic diagnosis in real world

  • By

  • Shoujun Huang

  • Junhong Chen

  • Jiaoman Wang

  • Ping Zhang

  • Wending Du

  • Yuan Hong

  • Dexing Kong

  • Wei Lou

  • Mingying Lai

  • Weihua Yang

  • June 22, 2026

  • 0 min

Share

  • 1

    The study evaluated nine leading multimodal large language models (MLLMs) for their diagnostic capabilities in ophthalmology using a curated dataset.

  • 2

    A benchmark dataset of 295 pathologically confirmed ophthalmic cases was created, integrating clinical narratives and medical images.

  • 3

    Models like HAIBU-ReMUD and ChatGPT-4o demonstrated strong diagnostic accuracy, with performance nearing that of human experts.

  • 4

    The evaluation focused on open-ended clinical question answering, multimodal information integration, and natural language reasoning.

  • 5

    The dataset included diverse ophthalmic cases, ensuring broad coverage across various subspecialties and excluding cases with insufficient data.

Original Source(s)

Related Content