Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews - Scorecard - MDSpire

Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews

By
Oami, Takehiko
Okada, Yohei
Maeda, Kenjiro
Nakada, Taka-aki
March 27, 2026
0 min

Frontiers In Digital Health

Share

Clinical Scorecard: Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews

At a Glance

Category	Detail
Condition	Systematic review data extraction
Key Mechanisms	Use of large language models (LLMs) with prompt engineering to automate extraction of background characteristics and clinical outcomes from trial reports
Target Population	Clinical trials addressing sepsis and septic shock management (Japanese Clinical Practice Guidelines 2024)
Care Setting	Systematic review and guideline development settings

Key Highlights

LLMs demonstrated high accuracy (81.6%-92.4%) for background data extraction but lower accuracy (27.8%-80.7%) for outcome data extraction.
Prompt engineering strategies (chain-of-thought, self-reflection) yielded only modest improvements in extraction accuracy.
Processing times increased substantially with self-reflection prompts, indicating a trade-off between accuracy and efficiency.

Guideline-Based Recommendations

Diagnosis

Use LLMs to assist with background data extraction in systematic reviews to improve efficiency.

Management

Maintain human oversight for outcome data extraction due to lower LLM accuracy and potential errors.
Select LLM models based on performance variability for specific extraction tasks.

Monitoring & Follow-up

Assess inter-session consistency of LLM outputs to ensure reproducibility.
Review extracted data for missing or incorrect values, which are the most common error types.

Risks

Potential for missing or incorrect data extraction by LLMs, especially for clinical outcomes.
Increased processing time with advanced prompt strategies may impact workflow efficiency.

Patient & Prescribing Data

Not applicable (focus on data extraction from clinical trial reports)

LLMs can support systematic review processes but require human validation to ensure data accuracy.

Clinical Best Practices

Utilize LLMs primarily for background data extraction to enhance review efficiency.
Apply human review to outcome data extracted by LLMs to mitigate errors.
Consider model-specific performance differences when integrating LLMs into systematic review workflows.
Balance prompt complexity with processing time to optimize accuracy and efficiency.

References

Japanese Clinical Practice Guidelines for the Management of Sepsis and Septic Shock 2024 (J-SSCG 2024)

Original Source(s)

Frontiers In Digital Health

Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews

by Oami, Takehiko , Okada, Yohei , Maeda, Kenjiro , Nakada, Taka-aki
March 27, 2026

Related Content

Frontiers In Cardiovascular Medicine

Aspirin is associated with improved outcomes in sepsis patients with atrial fibrillation: an analysis of the MIMIC-IV and eICU-CRD databases

by Fangchao Chen, Yufeng Zhong, Dianyang Wang, Qiuyin Wei, Rui Su, Hongfei Ge, Wencai Wei, Wei Wang
May 26, 2026

Infection

Evaluation of ambulance blood cultures in patients with suspected sepsis. A rural prospective cohort study

by Lars-Jøran Andersson, Gunnar Skov Simonsen, Erik Solligård, Knut Fredriksen
May 22, 2026

Frontiers In Immunology

Teleology of immune system response to sepsis – failure due to dysregulation or adaptive response that sometimes fails?

by Krzysztof Laudanski
May 28, 2026