Clinical Report: Evaluation of Large Language Models in Retrieving Ophthalmic Literature
Overview
This study evaluates the performance of large language models (LLMs) in retrieving ophthalmic literature, revealing low recall but high precision across various models. ChatGPT Deep Research achieved the highest mean recall and F1 score among the evaluated models.
Background
The use of LLMs in clinical literature searches is growing, yet their reliability, particularly in specialized fields like Ophthalmology, remains underexplored. Understanding the efficacy of LLMs in retrieving relevant studies is important.
Data Highlights
Model
Recall
Precision
F1 Score
ChatGPT Deep Research
0.41
1.00
0.56
GPT Auto
0.41
1.00
0.56
Claude Sonnet 4.6
0.16–0.41
0.78–1.00
0.25–0.56
Gemini 3 Pro
0.16–0.41
0.78–1.00
0.25–0.56
Grok 4.1
0.16–0.41
0.78–1.00
0.25–0.56
Key Findings
LLMs demonstrated low recall (0.16–0.41) but high precision (0.78–1.00).
ChatGPT Deep Research achieved the highest mean recall (0.41) and F1 score (0.56).
All models showed perfect precision (1.00) across all topics.
Performance varied by topic, with higher recall for rarer topics (0.29–0.76).
LLM searches did not identify additional studies beyond those found by manual search.
Hallucinated citations were essentially absent, with non-relevant articles primarily reflecting scope drift.
Clinical Implications
Clinicians can utilize LLMs for rapid literature scope with confidence in the relevance of generated lists, though many relevant studies may be overlooked. Caution is advised when relying on LLMs for comprehensive literature searches in systematic reviews or guideline development.
Conclusion
The findings suggest that while LLMs can assist in quickly orienting clinicians within the literature, they are not sufficient for exhaustive study identification. Continued evaluation of LLM performance is necessary as these tools become more integrated into clinical practice.