Performance of large language models for ophthalmic literature retrieval - Report - MDSpire

Performance of large language models for ophthalmic literature retrieval

  • By

  • Jai Paris

  • Oliver Kleinig

  • Ayushi Agarwal

  • Weng Onn Chan

  • Dinesh Selva

  • June 12, 2026

  • 0 min

Share

Clinical Report: Evaluation of Large Language Models in Retrieving Ophthalmic Literature

Overview

This study evaluates the performance of large language models (LLMs) in retrieving ophthalmic literature, revealing low recall but high precision across various models. ChatGPT Deep Research achieved the highest mean recall and F1 score among the evaluated models.

Background

The use of LLMs in clinical literature searches is growing, yet their reliability, particularly in specialized fields like Ophthalmology, remains underexplored. Understanding the efficacy of LLMs in retrieving relevant studies is important.

Data Highlights

ModelRecallPrecisionF1 Score
ChatGPT Deep Research0.411.000.56
GPT Auto0.411.000.56
Claude Sonnet 4.60.16–0.410.78–1.000.25–0.56
Gemini 3 Pro0.16–0.410.78–1.000.25–0.56
Grok 4.10.16–0.410.78–1.000.25–0.56

Key Findings

  • LLMs demonstrated low recall (0.16–0.41) but high precision (0.78–1.00).
  • ChatGPT Deep Research achieved the highest mean recall (0.41) and F1 score (0.56).
  • All models showed perfect precision (1.00) across all topics.
  • Performance varied by topic, with higher recall for rarer topics (0.29–0.76).
  • LLM searches did not identify additional studies beyond those found by manual search.
  • Hallucinated citations were essentially absent, with non-relevant articles primarily reflecting scope drift.

Clinical Implications

Clinicians can utilize LLMs for rapid literature scope with confidence in the relevance of generated lists, though many relevant studies may be overlooked. Caution is advised when relying on LLMs for comprehensive literature searches in systematic reviews or guideline development.

Conclusion

The findings suggest that while LLMs can assist in quickly orienting clinicians within the literature, they are not sufficient for exhaustive study identification. Continued evaluation of LLM performance is necessary as these tools become more integrated into clinical practice.

Related Resources & Content

  1. Author(s)/Org, Source, Year -- Title
  2. Author(s)/Org, Source, Year -- Title
  3. Author(s)/Org, Source, Year -- Title
  4. Author(s)/Org, Source, Year -- Title
  5. Author(s)/Org, Source, Year -- Title
  6. Author(s)/Org, Source, Year -- Title
  7. Author(s)/Org, Source, Year -- Title
  8. CP-ACPJ220177 1..21
  9. Large language models provide discordant information compared to ophthalmology guidelines | Scientific Reports
  10. Performance of large language models and prompt engineering strategies for data extraction in systematic reviews

Original Source(s)

Related Content