Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews

By
Oami, Takehiko
Okada, Yohei
Maeda, Kenjiro
Nakada, Taka-aki
March 27, 2026
0 min

Frontiers In Digital Health

Overview

This study assessed the accuracy, consistency, and efficiency of three large language models (ChatGPT-4o, Claude 3 Sonnet, Gemini 1.5 Pro) in extracting data from clinical trials for systematic reviews. Claude 3 Sonnet demonstrated the highest accuracy for both background and outcome data extraction, while prompt optimization strategies had limited impact on performance.

Background

Systematic reviews require meticulous manual data extraction, which is labor-intensive and susceptible to human error. Large language models (LLMs) offer potential automation to streamline this process, but their reliability and reproducibility across different models and prompt techniques are not well established. This study focused on evaluating LLMs using clinical trial data from the Japanese Clinical Practice Guidelines for Sepsis and Septic Shock 2024. The aim was to compare extraction accuracy, consistency, and processing time across models and prompt strategies.

Data Highlights

Metric	ChatGPT-4o	Claude 3 Sonnet	Gemini 1.5 Pro
Background Data Extraction Accuracy (No-error %)	81.6%	92.4%	Not specified
Outcome Data Extraction Accuracy (No-error %)	Not specified	80.7%	27.8%
Inter-session Consistency Background Data	76.3%	Not specified	91.3%
Inter-session Consistency Outcome Data	44.8%	65.6%	Not specified
Processing Time Background Data (seconds)	29.2 - 39.7	Not specified	Not specified
Processing Time Outcome Data (seconds)	19.3 - 46.3	Not specified	Not specified
Processing Time with Self-Reflection Prompts Background Data (seconds)	59.0 - 97.7	Not specified	Not specified
Processing Time with Self-Reflection Prompts Outcome Data (seconds)	52.7 - 107.1	Not specified	Not specified

Key Findings

Claude 3 Sonnet achieved the highest accuracy for both background (92.4%) and outcome (80.7%) data extraction.
ChatGPT-4o showed moderate accuracy for background data (81.6%) but lower consistency and outcome data accuracy.
Gemini 1.5 Pro had the lowest outcome data extraction accuracy (27.8%) but highest inter-session consistency for background data (91.3%).
Most extraction errors were due to missing or incorrect values; fabricated data were rare.
Prompt engineering strategies, including chain-of-thought and self-reflection, only modestly improved accuracy but increased processing times significantly.
Inter-session consistency was generally higher for background data extraction than for outcome data extraction across all models.

Clinical Implications

LLMs can effectively support background data extraction in systematic reviews, potentially reducing manual workload and errors. However, outcome data extraction remains less reliable, necessitating continued human oversight to ensure data accuracy. Clinicians and researchers should consider model selection carefully and be aware that prompt optimization may increase processing time without substantial accuracy gains.

Conclusion

While large language models show promise in automating background data extraction for systematic reviews, challenges persist in accurately extracting outcome data. Human validation remains essential to maintain data integrity in clinical guideline development.

References

Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews

Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews

Evaluating Large Language Models for Data Extraction in Systematic Reviews

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews

Related Content

Association of TyG index with sepsis incidence and mortality: a prospective study with diabetes stratification

Comparative efficacy and safety of immunomodulatory therapies for sepsis: a systematic review and network meta-analysis

Delayed diagnosis is associated with complications following invasive meningococcal disease in Australian adolescents and young adults