Performance of large language models for ophthalmic literature retrieval
By
Jai Paris
Oliver Kleinig
Ayushi Agarwal
Weng Onn Chan
Dinesh Selva
June 12, 2026
Clinical Scorecard: Evaluation of Large Language Models in Retrieving Ophthalmic Literature
At a Glance
Category Detail
Condition
Key Mechanisms Evaluation of large language models (LLMs) for literature search and retrieval.
Target Population
Care Setting
Key Highlights
LLMs demonstrated low recall (0.16–0.41) but high precision (0.78–1.00). ChatGPT Deep Research achieved the highest mean recall (0.41) and F1 score (0.56). Performance varied by topic, with higher recall for rarer topics (0.29–0.76). LLM searches did not identify additional studies beyond manual searches. Hallucinated citations were essentially absent.
Guideline-Based Recommendations
Diagnosis
Management
Monitoring & Follow-up
Risks
Patient & Prescribing Data
Not specified
LLMs are high-precision tools for rapid literature scope.
Clinical Best Practices
Use LLMs for quick orientation within unfamiliar evidence bases. Remain cautious of LLM limitations for comprehensive study identification.
Related Resources & Content