Performance of large language models for ophthalmic literature retrieval - Summary - MDSpire

Performance of large language models for ophthalmic literature retrieval

By
Jai Paris
Oliver Kleinig
Ayushi Agarwal
Weng Onn Chan
Dinesh Selva
June 12, 2026
0 min

Eye

Share

Objective:

To evaluate the reliability of large language models (LLMs) in retrieving ophthalmic literature.

Key Findings:

LLMs demonstrated low recall (0.16–0.41) but high precision (0.78–1.00).
ChatGPT Deep Research achieved the highest mean recall (0.41) and F1 score (0.56).
Performance varied by topic, with higher recall for rarer topics (0.29–0.76).
LLM searches did not identify additional studies beyond those found by manual search.
Hallucinated citations were essentially absent, with non-relevant articles better characterized as scope drift.

Interpretation:

Limitations:

Limited by small sample sizes.
Restricted access to paywalled content may contribute to low recall.

Conclusion:

Original Source(s)

Eye

Performance of large language models for ophthalmic literature retrieval

by Jai Paris, Oliver Kleinig, Ayushi Agarwal, Weng Onn Chan, Dinesh Selva
June 12, 2026

Related Content

The Ophthalmologist

Polarization-Sensitive OCT Targets Subclinical Keratoconus

AI-enhanced PS-optical coherence tomography may sharpen subclinical keratoconus detection

June 10, 2026
3 min

The Ophthalmologist

Biogen Finalizes Apellis Takeover

Biogen closes its takeover of Apellis, gaining SYFOVRE and EMPAVELI while expanding its rare disease and kidney disease portfolio

June 8, 2026
2 min

Conexiant

VR Perimetry May Match HVF in Glaucoma

Virtual reality testing shortened visual field exams and was preferred among most patients in a retrospective study.

by Andrea Surnit
June 3, 2026
4 min