Large language model-based uncertainty-adjusted label extraction for artificial intelligence model development in upper extremity radiography - Report - MDSpire

Large language model-based uncertainty-adjusted label extraction for artificial intelligence model development in upper extremity radiography

  • By

  • Hanna Kreutzer

  • Anne-Sophie Caselitz

  • Thomas Dratsch

  • Daniel Pinto dos Santos

  • Christiane Kuhl

  • Daniel Truhn

  • Sven Nebelung

  • November 14, 2025

  • 0 min

Share

LLM-Based Uncertainty-Aware Label Extraction for Upper Extremity Radiography AI Models

Overview

This study demonstrates that large language models (LLMs), specifically GPT-4o, can accurately extract multi-label structured data from radiologic reports of the clavicle, elbow, and thumb, including uncertainty detection. Incorporating uncertainty-aware labeling strategies enabled effective training of convolutional neural networks (CNNs) for multi-label classification, with model performance validated on internal and external datasets.

Background

Radiologic imaging is performed billions of times annually worldwide, yet AI development is hindered by limited annotated datasets. Manual annotation is resource-intensive and prone to inconsistency, while traditional NLP methods for label extraction struggle with complex terminology and uncertainty in reports. Large language models offer a promising alternative by interpreting nuanced language and extracting structured labels, including uncertain findings, which are common in radiology. Prior work has not addressed multi-label extraction across multiple upper extremity regions or accounted for uncertainty in labels.

Data Highlights

DatasetRegionNumber of PatientsData Split
Internal (Aachen)Clavicle, Elbow, ThumbNot specifiedTraining 64%, Validation 16%, Test 20%
External (Cologne)Clavicle, Elbow, Thumb300 per regionTest only

Key Findings

  • GPT-4o effectively extracted structured labels from free-text radiologic reports across multiple upper extremity regions.
  • Labels included three states: true, false, and uncertain, capturing diagnostic ambiguity inherent in radiology reports.
  • Uncertain labels were handled via inclusive (counted as true) and exclusive (counted as false) strategies during CNN training.
  • Multi-label CNNs trained on LLM-extracted labels achieved robust classification performance on both internal and external test sets.
  • Accounting for label uncertainty did not adversely affect model performance, supporting the hypothesis that uncertainty-aware labeling is feasible and beneficial.

Clinical Implications

The use of LLMs for automated, uncertainty-aware label extraction can significantly reduce the labor and cost associated with manual annotation of radiologic datasets. This approach enables scalable development of AI models for multi-label classification in upper extremity radiography, potentially improving diagnostic support tools. Incorporating uncertainty in labels preserves clinically relevant ambiguity, which may enhance model robustness and generalizability.

Conclusion

LLMs such as GPT-4o can accurately and efficiently extract multi-label, uncertainty-aware annotations from radiologic reports, facilitating the training of effective AI models for upper extremity radiography. This methodology addresses key challenges in dataset curation and supports the advancement of clinically relevant AI applications.

References

  1. World Health Organization 2023 -- Global Imaging Procedure Estimates
  2. Al Mohamad et al 2023 -- LLM-Based Fracture Label Extraction in Ankle Radiographs
  3. Prior Reviews 2022 -- AI Models for Upper Extremity Fracture Detection

Original Source(s)

Related Content