GPT-4.1 and Llama 3.3 70 fail to detect clinically relevant errors in radiology reports in zero-shot evaluation - Scorecard - MDSpire

GPT-4.1 and Llama 3.3 70 fail to detect clinically relevant errors in radiology reports in zero-shot evaluation

  • By

  • Tugba Akinci D’Antonoli

  • Lisa C. Adams

  • Jannik Lübberstedt

  • Markus M. Graf

  • Christian J. Mertens

  • Felix Busch

  • Sebastian Ziegelmayer

  • Marcus R. Makowski

  • Keno Bressem

  • Ina Luiken

  • June 19, 2026

  • 0 min

Share

Clinical Scorecard: GPT-4.1 and Llama 3.3 70 Show Limitations in Identifying Clinically Significant Errors in Radiology Reports During Zero-Shot Assessment

At a Glance

CategoryDetail
ConditionRadiology Report Error Detection
Key MechanismsEvaluation of large language models (LLMs) for identifying clinically relevant errors in radiology reports.
Target PopulationRadiologists and healthcare professionals involved in diagnostic imaging.
Care SettingClinical radiology departments.

Key Highlights

  • LLMs like GPT-4.1 and Llama 3.3 70B were evaluated for error detection in radiology reports.
  • The study identified a gap between pattern-based and reasoning-dependent error detection.
  • Errors were categorized into anatomical mislabeling, physiologically impossible findings, diagnostic inconsistencies, and inappropriate recommendations.
  • The dataset included 256 radiology reports modified with 1024 error variants.
  • The evaluation framework and dataset are publicly available for future benchmarking.

Guideline-Based Recommendations

Diagnosis

  • Utilize structured radiology reports to enhance error detection capabilities.

Management

  • Incorporate LLMs with caution, ensuring they are supplemented by expert review.

Monitoring & Follow-up

  • Continuously assess the performance of LLMs in clinical settings to ensure safety.

Risks

  • Potential for LLMs to misinterpret or overlook clinically significant errors.

Patient & Prescribing Data

Patients undergoing radiographic examinations.

Awareness of the limitations of LLMs in accurately detecting errors in radiology reports.

Clinical Best Practices

  • Engage board-certified radiologists in the verification of errors in reports.
  • Ensure that LLM outputs are reviewed by qualified professionals before clinical application.
  • Maintain a focus on both linguistic fluency and clinical accuracy in automated systems.

Related Resources & Content

Original Source(s)

Related Content