Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews - Summary - MDSpire

Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews

  • By

  • Oami, Takehiko

  • Okada, Yohei

  • Maeda, Kenjiro

  • Nakada, Taka-aki

  • March 27, 2026

  • 0 min

Share

Objective:

To evaluate the performance of large language models (LLMs) for data extraction in systematic reviews and assess the impact of different prompt strategies.

Key Findings:
  • Mean no-error proportions for background data extraction ranged from 81.6% to 92.4%.
  • Mean no-error proportions for outcome data extraction ranged from 27.8% to 80.7%.
  • Most errors were due to missing or incorrect values; fabricated outputs were rare.
  • Prompt engineering had modest effects on extraction accuracy.
  • Inter-session consistency varied from 76.3% to 91.3% for background data and 44.8% to 65.6% for outcome data.
  • Processing times for background extraction ranged from 29.2 to 39.7 seconds, and for outcome extraction from 19.3 to 46.3 seconds.
Interpretation:

LLMs can effectively assist in background data extraction for systematic reviews, but challenges remain in outcome data extraction, highlighting the need for human oversight.

Limitations:
  • Performance varied significantly across different LLMs.
  • Prompt engineering strategies showed only modest improvements in accuracy.
  • The study focused on a limited number of clinical questions.
Conclusion:

While LLMs demonstrate reliability in background data extraction, outcome data extraction requires further refinement and human involvement.

Original Source(s)

Related Content