Evaluating Large Language Models for Data Extraction in Systematic Reviews
Overview
This study assessed the accuracy, consistency, and efficiency of three large language models (ChatGPT-4o, Claude 3 Sonnet, Gemini 1.5 Pro) in extracting data from clinical trials for systematic reviews. Claude 3 Sonnet demonstrated the highest accuracy for both background and outcome data extraction, while prompt optimization strategies had limited impact on performance.
Background
Systematic reviews require meticulous manual data extraction, which is labor-intensive and susceptible to human error. Large language models (LLMs) offer potential automation to streamline this process, but their reliability and reproducibility across different models and prompt techniques are not well established. This study focused on evaluating LLMs using clinical trial data from the Japanese Clinical Practice Guidelines for Sepsis and Septic Shock 2024. The aim was to compare extraction accuracy, consistency, and processing time across models and prompt strategies.
Data Highlights
Metric
ChatGPT-4o
Claude 3 Sonnet
Gemini 1.5 Pro
Background Data Extraction Accuracy (No-error %)
81.6%
92.4%
Not specified
Outcome Data Extraction Accuracy (No-error %)
Not specified
80.7%
27.8%
Inter-session Consistency Background Data
76.3%
Not specified
91.3%
Inter-session Consistency Outcome Data
44.8%
65.6%
Not specified
Processing Time Background Data (seconds)
29.2 - 39.7
Not specified
Not specified
Processing Time Outcome Data (seconds)
19.3 - 46.3
Not specified
Not specified
Processing Time with Self-Reflection Prompts Background Data (seconds)
59.0 - 97.7
Not specified
Not specified
Processing Time with Self-Reflection Prompts Outcome Data (seconds)
52.7 - 107.1
Not specified
Not specified
Key Findings
Claude 3 Sonnet achieved the highest accuracy for both background (92.4%) and outcome (80.7%) data extraction.
ChatGPT-4o showed moderate accuracy for background data (81.6%) but lower consistency and outcome data accuracy.
Gemini 1.5 Pro had the lowest outcome data extraction accuracy (27.8%) but highest inter-session consistency for background data (91.3%).
Most extraction errors were due to missing or incorrect values; fabricated data were rare.
Prompt engineering strategies, including chain-of-thought and self-reflection, only modestly improved accuracy but increased processing times significantly.
Inter-session consistency was generally higher for background data extraction than for outcome data extraction across all models.
Clinical Implications
LLMs can effectively support background data extraction in systematic reviews, potentially reducing manual workload and errors. However, outcome data extraction remains less reliable, necessitating continued human oversight to ensure data accuracy. Clinicians and researchers should consider model selection carefully and be aware that prompt optimization may increase processing time without substantial accuracy gains.
Conclusion
While large language models show promise in automating background data extraction for systematic reviews, challenges persist in accurately extracting outcome data. Human validation remains essential to maintain data integrity in clinical guideline development.
References
Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews