Clinical Scorecard: Evaluating the Efficacy of Large Language Models and Prompt Optimization Techniques for Data Extraction in Systematic Reviews
At a Glance
Category
Detail
Condition
Systematic review data extraction
Key Mechanisms
Use of large language models (LLMs) with prompt engineering to automate extraction of background characteristics and clinical outcomes from trial reports
Target Population
Clinical trials addressing sepsis and septic shock management (Japanese Clinical Practice Guidelines 2024)
Care Setting
Systematic review and guideline development settings
Key Highlights
LLMs demonstrated high accuracy (81.6%-92.4%) for background data extraction but lower accuracy (27.8%-80.7%) for outcome data extraction.
Prompt engineering strategies (chain-of-thought, self-reflection) yielded only modest improvements in extraction accuracy.
Processing times increased substantially with self-reflection prompts, indicating a trade-off between accuracy and efficiency.
Guideline-Based Recommendations
Diagnosis
Use LLMs to assist with background data extraction in systematic reviews to improve efficiency.
Management
Maintain human oversight for outcome data extraction due to lower LLM accuracy and potential errors.
Select LLM models based on performance variability for specific extraction tasks.
Monitoring & Follow-up
Assess inter-session consistency of LLM outputs to ensure reproducibility.
Review extracted data for missing or incorrect values, which are the most common error types.
Risks
Potential for missing or incorrect data extraction by LLMs, especially for clinical outcomes.
Increased processing time with advanced prompt strategies may impact workflow efficiency.
Patient & Prescribing Data
Not applicable (focus on data extraction from clinical trial reports)
LLMs can support systematic review processes but require human validation to ensure data accuracy.
Clinical Best Practices
Utilize LLMs primarily for background data extraction to enhance review efficiency.
Apply human review to outcome data extracted by LLMs to mitigate errors.
Consider model-specific performance differences when integrating LLMs into systematic review workflows.
Balance prompt complexity with processing time to optimize accuracy and efficiency.