To evaluate the performance of large language models (LLMs) for data extraction in systematic reviews and assess the impact of different prompt strategies.
Key Findings:
Mean no-error proportions for background data extraction ranged from 81.6% to 92.4%.
Mean no-error proportions for outcome data extraction ranged from 27.8% to 80.7%.
Most errors were due to missing or incorrect values; fabricated outputs were rare.
Prompt engineering had modest effects on extraction accuracy.
Inter-session consistency varied from 76.3% to 91.3% for background data and 44.8% to 65.6% for outcome data.
Processing times for background extraction ranged from 29.2 to 39.7 seconds, and for outcome extraction from 19.3 to 46.3 seconds.
Interpretation:
LLMs can effectively assist in background data extraction for systematic reviews, but challenges remain in outcome data extraction, highlighting the need for human oversight.
Limitations:
Performance varied significantly across different LLMs.
Prompt engineering strategies showed only modest improvements in accuracy.
The study focused on a limited number of clinical questions.
Conclusion:
While LLMs demonstrate reliability in background data extraction, outcome data extraction requires further refinement and human involvement.