Performance of large language models for ophthalmic literature retrieval

By
Jai Paris
Oliver Kleinig
Ayushi Agarwal
Weng Onn Chan
Dinesh Selva
June 12, 2026
0 min

Overview

This study evaluates the performance of large language models (LLMs) in retrieving ophthalmic literature, revealing low recall but high precision across various models. ChatGPT Deep Research achieved the highest mean recall and F1 score among the evaluated models.

Background

The use of LLMs in clinical literature searches is growing, yet their reliability, particularly in specialized fields like Ophthalmology, remains underexplored. Understanding the efficacy of LLMs in retrieving relevant studies is important.

Data Highlights

Model	Recall	Precision	F1 Score
ChatGPT Deep Research	0.41	1.00	0.56
GPT Auto	0.41	1.00	0.56
Claude Sonnet 4.6	0.16–0.41	0.78–1.00	0.25–0.56
Gemini 3 Pro	0.16–0.41	0.78–1.00	0.25–0.56
Grok 4.1	0.16–0.41	0.78–1.00	0.25–0.56

Key Findings

LLMs demonstrated low recall (0.16–0.41) but high precision (0.78–1.00).
ChatGPT Deep Research achieved the highest mean recall (0.41) and F1 score (0.56).
All models showed perfect precision (1.00) across all topics.
Performance varied by topic, with higher recall for rarer topics (0.29–0.76).
LLM searches did not identify additional studies beyond those found by manual search.
Hallucinated citations were essentially absent, with non-relevant articles primarily reflecting scope drift.

Clinical Implications

Clinicians can utilize LLMs for rapid literature scope with confidence in the relevance of generated lists, though many relevant studies may be overlooked. Caution is advised when relying on LLMs for comprehensive literature searches in systematic reviews or guideline development.

Conclusion

The findings suggest that while LLMs can assist in quickly orienting clinicians within the literature, they are not sufficient for exhaustive study identification. Continued evaluation of LLM performance is necessary as these tools become more integrated into clinical practice.