Performance of large language models for ophthalmic literature retrieval
-
By
-
Jai Paris
-
Oliver Kleinig
-
Ayushi Agarwal
-
Weng Onn Chan
-
Dinesh Selva
-
June 12, 2026
-
Objective:
To evaluate the reliability of large language models (LLMs) in retrieving ophthalmic literature.
Key Findings:
- LLMs demonstrated low recall (0.16–0.41) but high precision (0.78–1.00).
- ChatGPT Deep Research achieved the highest mean recall (0.41) and F1 score (0.56).
- Performance varied by topic, with higher recall for rarer topics (0.29–0.76).
- LLM searches did not identify additional studies beyond those found by manual search.
- Hallucinated citations were essentially absent, with non-relevant articles better characterized as scope drift.
Interpretation:
Limitations:
- Limited by small sample sizes.
- Restricted access to paywalled content may contribute to low recall.
Conclusion: