Large Language Model Automated Extraction of Clinical Signs and Symptoms From Emergency Department Reports for Machine Learning Prediction Models: Development and Validation Study

By
Anoeska Schipper
Peter Belgers
Rory David O'Connor
Lieke van de Wouw
Luc Builtjes
Joeran S Bosma
Ron Kusters
Steef Kurstjens
Matthieu Rutten
Bram van Ginneken
April 30, 2026
0 min

Jmir Medical Informatics

Overview

This study demonstrates that a compact multilingual large language model (Qwen 2.5:14B) can accurately extract 16 clinical signs and symptoms from Dutch emergency department (ED) reports related to acute abdominal pain (AAP). The LLM-extracted features closely matched physician annotations and maintained comparable performance in an appendicitis prediction model, supporting scalable and privacy-preserving clinical decision support.

Background

Emergency department clinical data is predominantly recorded as free text, complicating its reuse for research and decision support. Traditional NLP methods require extensive preprocessing and annotated datasets, limiting scalability. Large language models (LLMs) offer a promising alternative through zero-shot prompting, but their application to ED reports, especially in less represented languages like Dutch, remains underexplored. This study evaluates whether a smaller multilingual LLM can reliably extract complex clinical features from Dutch ED reports to support predictive modeling for appendicitis.

Data Highlights

Data Characteristic	Value
Number of ED reports analyzed	336
Appendicitis cases	167
Other acute abdominal pain cases	169
Clinical signs and symptoms extracted	16 (8 binary, 1 multiclass, 7 multilabel)
Interrater agreement (Krippendorff α)	0.93 (binary), 0.95 (multiclass)
Interrater agreement (Jaccard similarity)	0.76 (multilabel)

Key Findings

The compact multilingual LLM (Qwen 2.5:14B) achieved near-expert precision in extracting 16 clinical features from Dutch ED reports using zero-shot prompting.
LLM-extracted features showed high concordance with physician annotations, validated by strong interrater reliability metrics.
The extracted features included binary, multiclass, and multilabel clinical attributes relevant to acute abdominal pain and appendicitis prediction.
When used as input for the HIVE appendicitis prediction model, LLM-extracted features maintained comparable predictive performance to manually annotated data.
The approach supports a scalable, privacy-preserving workflow by automating labor-intensive manual annotation without compromising data quality.

Clinical Implications

Automated extraction of clinical signs and symptoms from ED free-text reports using a compact multilingual LLM can streamline data collection for decision support systems without additional burden on clinicians. This method enables rapid, accurate feature extraction in languages with limited LLM training data, facilitating broader implementation of predictive models like HIVE for appendicitis. Clinicians and health systems may adopt such workflows to enhance diagnostic accuracy and operational efficiency in emergency care.

Conclusion

This study validates that a small multilingual LLM can effectively automate the extraction of complex clinical features from Dutch ED reports, preserving the performance of downstream predictive models. The findings endorse the feasibility of scalable, privacy-conscious NLP solutions for emergency medicine applications.

References

Original Study -- Automated Extraction of Clinical Signs and Symptoms from Emergency Department Reports Using Large Language Models

Large Language Model Automated Extraction of Clinical Signs and Symptoms From Emergency Department Reports for Machine Learning Prediction Models: Development and Validation Study

Automated Extraction of Clinical Signs from ED Reports Using Multilingual LLMs

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Large Language Model Automated Extraction of Clinical Signs and Symptoms From Emergency Department Reports for Machine Learning Prediction Models: Development and Validation Study

Related Content

Medical schools must continue to teach students about structural barriers to care

Assessing the Effectiveness of the RACE Score in Distinguishing Stroke from Stroke Mimics in Emergency Department Settings

Case Study: Spontaneous Rupture of an Internal Thoracic Artery Aneurysm - A Rare and Critical Emergency with Treatment Challenges