Large Language Model Automated Extraction of Clinical Signs and Symptoms From Emergency Department Reports for Machine Learning Prediction Models: Development and Validation Study - Report - MDSpire
Advertisement
Large Language Model Automated Extraction of Clinical Signs and Symptoms From Emergency Department Reports for Machine Learning Prediction Models: Development and Validation Study
Automated Extraction of Clinical Signs from ED Reports Using Multilingual LLMs
Overview
This study demonstrates that a compact multilingual large language model (Qwen 2.5:14B) can accurately extract 16 clinical signs and symptoms from Dutch emergency department (ED) reports related to acute abdominal pain (AAP). The LLM-extracted features closely matched physician annotations and maintained comparable performance in an appendicitis prediction model, supporting scalable and privacy-preserving clinical decision support.
Background
Emergency department clinical data is predominantly recorded as free text, complicating its reuse for research and decision support. Traditional NLP methods require extensive preprocessing and annotated datasets, limiting scalability. Large language models (LLMs) offer a promising alternative through zero-shot prompting, but their application to ED reports, especially in less represented languages like Dutch, remains underexplored. This study evaluates whether a smaller multilingual LLM can reliably extract complex clinical features from Dutch ED reports to support predictive modeling for appendicitis.
Data Highlights
Data Characteristic
Value
Number of ED reports analyzed
336
Appendicitis cases
167
Other acute abdominal pain cases
169
Clinical signs and symptoms extracted
16 (8 binary, 1 multiclass, 7 multilabel)
Interrater agreement (Krippendorff α)
0.93 (binary), 0.95 (multiclass)
Interrater agreement (Jaccard similarity)
0.76 (multilabel)
Key Findings
The compact multilingual LLM (Qwen 2.5:14B) achieved near-expert precision in extracting 16 clinical features from Dutch ED reports using zero-shot prompting.
LLM-extracted features showed high concordance with physician annotations, validated by strong interrater reliability metrics.
The extracted features included binary, multiclass, and multilabel clinical attributes relevant to acute abdominal pain and appendicitis prediction.
When used as input for the HIVE appendicitis prediction model, LLM-extracted features maintained comparable predictive performance to manually annotated data.
The approach supports a scalable, privacy-preserving workflow by automating labor-intensive manual annotation without compromising data quality.
Clinical Implications
Automated extraction of clinical signs and symptoms from ED free-text reports using a compact multilingual LLM can streamline data collection for decision support systems without additional burden on clinicians. This method enables rapid, accurate feature extraction in languages with limited LLM training data, facilitating broader implementation of predictive models like HIVE for appendicitis. Clinicians and health systems may adopt such workflows to enhance diagnostic accuracy and operational efficiency in emergency care.
Conclusion
This study validates that a small multilingual LLM can effectively automate the extraction of complex clinical features from Dutch ED reports, preserving the performance of downstream predictive models. The findings endorse the feasibility of scalable, privacy-conscious NLP solutions for emergency medicine applications.
References
Original Study -- Automated Extraction of Clinical Signs and Symptoms from Emergency Department Reports Using Large Language Models
by Anoeska Schipper, Peter Belgers, Rory David O'Connor, Lieke van de Wouw, Luc Builtjes, Joeran S Bosma, Ron Kusters, Steef Kurstjens, Matthieu Rutten, Bram van Ginneken