Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews - Scorecard - MDSpire

Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews

  • By

  • Oami, Takehiko

  • Okada, Yohei

  • Maeda, Kenjiro

  • Nakada, Taka-aki

  • March 27, 2026

  • 0 min

Share

Clinical Scorecard: Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews

At a Glance

CategoryDetail
ConditionSystematic review data extraction
Key MechanismsUse of large language models (LLMs) with prompt engineering to automate extraction of background characteristics and clinical outcomes from trial reports
Target PopulationClinical trials addressing sepsis and septic shock management (Japanese Clinical Practice Guidelines 2024)
Care SettingSystematic review and guideline development settings

Key Highlights

  • LLMs demonstrated high accuracy (81.6%-92.4%) for background data extraction but lower accuracy (27.8%-80.7%) for outcome data extraction.
  • Prompt engineering strategies (chain-of-thought, self-reflection) yielded only modest improvements in extraction accuracy.
  • Processing times increased substantially with self-reflection prompts, indicating a trade-off between accuracy and efficiency.

Guideline-Based Recommendations

Diagnosis

  • Use LLMs to assist with background data extraction in systematic reviews to improve efficiency.

Management

  • Maintain human oversight for outcome data extraction due to lower LLM accuracy and potential errors.
  • Select LLM models based on performance variability for specific extraction tasks.

Monitoring & Follow-up

  • Assess inter-session consistency of LLM outputs to ensure reproducibility.
  • Review extracted data for missing or incorrect values, which are the most common error types.

Risks

  • Potential for missing or incorrect data extraction by LLMs, especially for clinical outcomes.
  • Increased processing time with advanced prompt strategies may impact workflow efficiency.

Patient & Prescribing Data

Not applicable (focus on data extraction from clinical trial reports)

LLMs can support systematic review processes but require human validation to ensure data accuracy.

Clinical Best Practices

  • Utilize LLMs primarily for background data extraction to enhance review efficiency.
  • Apply human review to outcome data extracted by LLMs to mitigate errors.
  • Consider model-specific performance differences when integrating LLMs into systematic review workflows.
  • Balance prompt complexity with processing time to optimize accuracy and efficiency.

References

Original Source(s)

Related Content