Performance of large language models for ophthalmic literature retrieval - Scorecard - MDSpire

Performance of large language models for ophthalmic literature retrieval

  • By

  • Jai Paris

  • Oliver Kleinig

  • Ayushi Agarwal

  • Weng Onn Chan

  • Dinesh Selva

  • June 12, 2026

  • 0 min

Share

Clinical Scorecard: Evaluation of Large Language Models in Retrieving Ophthalmic Literature

At a Glance

CategoryDetail
Condition
Key MechanismsEvaluation of large language models (LLMs) for literature search and retrieval.
Target Population
Care Setting

Key Highlights

  • LLMs demonstrated low recall (0.16–0.41) but high precision (0.78–1.00).
  • ChatGPT Deep Research achieved the highest mean recall (0.41) and F1 score (0.56).
  • Performance varied by topic, with higher recall for rarer topics (0.29–0.76).
  • LLM searches did not identify additional studies beyond manual searches.
  • Hallucinated citations were essentially absent.

Guideline-Based Recommendations

Diagnosis

    Management

      Monitoring & Follow-up

        Risks

          Patient & Prescribing Data

          Not specified

          LLMs are high-precision tools for rapid literature scope.

          Clinical Best Practices

          • Use LLMs for quick orientation within unfamiliar evidence bases.
          • Remain cautious of LLM limitations for comprehensive study identification.

          Related Resources & Content

          Original Source(s)

          Related Content