Large language model-based uncertainty-adjusted label extraction for artificial intelligence model development in upper extremity radiography

By
Hanna Kreutzer
Anne-Sophie Caselitz
Thomas Dratsch
Daniel Pinto dos Santos
Christiane Kuhl
Daniel Truhn
Sven Nebelung
November 14, 2025
0 min

European Radiology

Overview

This study demonstrates that large language models (LLMs), specifically GPT-4o, can accurately extract multi-label structured data from radiologic reports of the clavicle, elbow, and thumb, including uncertainty detection. Incorporating uncertainty-aware labeling strategies enabled effective training of convolutional neural networks (CNNs) for multi-label classification, with model performance validated on internal and external datasets.

Background

Radiologic imaging is performed billions of times annually worldwide, yet AI development is hindered by limited annotated datasets. Manual annotation is resource-intensive and prone to inconsistency, while traditional NLP methods for label extraction struggle with complex terminology and uncertainty in reports. Large language models offer a promising alternative by interpreting nuanced language and extracting structured labels, including uncertain findings, which are common in radiology. Prior work has not addressed multi-label extraction across multiple upper extremity regions or accounted for uncertainty in labels.

Data Highlights

Dataset	Region	Number of Patients	Data Split
Internal (Aachen)	Clavicle, Elbow, Thumb	Not specified	Training 64%, Validation 16%, Test 20%
External (Cologne)	Clavicle, Elbow, Thumb	300 per region	Test only

Key Findings

GPT-4o effectively extracted structured labels from free-text radiologic reports across multiple upper extremity regions.
Labels included three states: true, false, and uncertain, capturing diagnostic ambiguity inherent in radiology reports.
Uncertain labels were handled via inclusive (counted as true) and exclusive (counted as false) strategies during CNN training.
Multi-label CNNs trained on LLM-extracted labels achieved robust classification performance on both internal and external test sets.
Accounting for label uncertainty did not adversely affect model performance, supporting the hypothesis that uncertainty-aware labeling is feasible and beneficial.

Clinical Implications

The use of LLMs for automated, uncertainty-aware label extraction can significantly reduce the labor and cost associated with manual annotation of radiologic datasets. This approach enables scalable development of AI models for multi-label classification in upper extremity radiography, potentially improving diagnostic support tools. Incorporating uncertainty in labels preserves clinically relevant ambiguity, which may enhance model robustness and generalizability.

Conclusion

LLMs such as GPT-4o can accurately and efficiently extract multi-label, uncertainty-aware annotations from radiologic reports, facilitating the training of effective AI models for upper extremity radiography. This methodology addresses key challenges in dataset curation and supports the advancement of clinically relevant AI applications.

References

World Health Organization 2023 -- Global Imaging Procedure Estimates
Al Mohamad et al 2023 -- LLM-Based Fracture Label Extraction in Ankle Radiographs
Prior Reviews 2022 -- AI Models for Upper Extremity Fracture Detection

Large language model-based uncertainty-adjusted label extraction for artificial intelligence model development in upper extremity radiography

LLM-Based Uncertainty-Aware Label Extraction for Upper Extremity Radiography AI Models

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Large language model-based uncertainty-adjusted label extraction for artificial intelligence model development in upper extremity radiography

Related Content

Magnetic resonance imaging in the diagnosis of trigeminal neuralgia: a systematic review of the imaging protocol and diagnostic accuracy

Multimodal deep learning with anatomically constrained attention for screening MRI-detectable TMJ abnormalities from panoramic images

Mandatory Training Modules Deserve a Harder Look