Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews - Report - MDSpire

Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews

  • By

  • Oami, Takehiko

  • Okada, Yohei

  • Maeda, Kenjiro

  • Nakada, Taka-aki

  • March 27, 2026

  • 0 min

Share

Evaluating Large Language Models for Data Extraction in Systematic Reviews

Overview

This study assessed the accuracy, consistency, and efficiency of three large language models (ChatGPT-4o, Claude 3 Sonnet, Gemini 1.5 Pro) in extracting data from clinical trials for systematic reviews. Claude 3 Sonnet demonstrated the highest accuracy for both background and outcome data extraction, while prompt optimization strategies had limited impact on performance.

Background

Systematic reviews require meticulous manual data extraction, which is labor-intensive and susceptible to human error. Large language models (LLMs) offer potential automation to streamline this process, but their reliability and reproducibility across different models and prompt techniques are not well established. This study focused on evaluating LLMs using clinical trial data from the Japanese Clinical Practice Guidelines for Sepsis and Septic Shock 2024. The aim was to compare extraction accuracy, consistency, and processing time across models and prompt strategies.

Data Highlights

MetricChatGPT-4oClaude 3 SonnetGemini 1.5 Pro
Background Data Extraction Accuracy (No-error %)81.6%92.4%Not specified
Outcome Data Extraction Accuracy (No-error %)Not specified80.7%27.8%
Inter-session Consistency Background Data76.3%Not specified91.3%
Inter-session Consistency Outcome Data44.8%65.6%Not specified
Processing Time Background Data (seconds)29.2 - 39.7Not specifiedNot specified
Processing Time Outcome Data (seconds)19.3 - 46.3Not specifiedNot specified
Processing Time with Self-Reflection Prompts Background Data (seconds)59.0 - 97.7Not specifiedNot specified
Processing Time with Self-Reflection Prompts Outcome Data (seconds)52.7 - 107.1Not specifiedNot specified

Key Findings

  • Claude 3 Sonnet achieved the highest accuracy for both background (92.4%) and outcome (80.7%) data extraction.
  • ChatGPT-4o showed moderate accuracy for background data (81.6%) but lower consistency and outcome data accuracy.
  • Gemini 1.5 Pro had the lowest outcome data extraction accuracy (27.8%) but highest inter-session consistency for background data (91.3%).
  • Most extraction errors were due to missing or incorrect values; fabricated data were rare.
  • Prompt engineering strategies, including chain-of-thought and self-reflection, only modestly improved accuracy but increased processing times significantly.
  • Inter-session consistency was generally higher for background data extraction than for outcome data extraction across all models.

Clinical Implications

LLMs can effectively support background data extraction in systematic reviews, potentially reducing manual workload and errors. However, outcome data extraction remains less reliable, necessitating continued human oversight to ensure data accuracy. Clinicians and researchers should consider model selection carefully and be aware that prompt optimization may increase processing time without substantial accuracy gains.

Conclusion

While large language models show promise in automating background data extraction for systematic reviews, challenges persist in accurately extracting outcome data. Human validation remains essential to maintain data integrity in clinical guideline development.

References

  1. Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews

Original Source(s)

Related Content