Performance of large language models for ophthalmic literature retrieval - Summary - MDSpire

Performance of large language models for ophthalmic literature retrieval

  • By

  • Jai Paris

  • Oliver Kleinig

  • Ayushi Agarwal

  • Weng Onn Chan

  • Dinesh Selva

  • June 12, 2026

  • 0 min

Share

Objective:

To evaluate the reliability of large language models (LLMs) in retrieving ophthalmic literature.

Key Findings:
  • LLMs demonstrated low recall (0.16–0.41) but high precision (0.78–1.00).
  • ChatGPT Deep Research achieved the highest mean recall (0.41) and F1 score (0.56).
  • Performance varied by topic, with higher recall for rarer topics (0.29–0.76).
  • LLM searches did not identify additional studies beyond those found by manual search.
  • Hallucinated citations were essentially absent, with non-relevant articles better characterized as scope drift.
Interpretation:

Limitations:
  • Limited by small sample sizes.
  • Restricted access to paywalled content may contribute to low recall.
Conclusion:

Original Source(s)

Related Content