GPT-4.1 and Llama 3.3 70 fail to detect clinically relevant errors in radiology reports in zero-shot evaluation

Category	Detail
Condition	Radiology Report Error Detection
Key Mechanisms	Evaluation of large language models (LLMs) for identifying clinically relevant errors in radiology reports.
Target Population	Radiologists and healthcare professionals involved in diagnostic imaging.
Care Setting	Clinical radiology departments.

LLMs like GPT-4.1 and Llama 3.3 70B were evaluated for error detection in radiology reports.
The study identified a gap between pattern-based and reasoning-dependent error detection.
Errors were categorized into anatomical mislabeling, physiologically impossible findings, diagnostic inconsistencies, and inappropriate recommendations.
The dataset included 256 radiology reports modified with 1024 error variants.
The evaluation framework and dataset are publicly available for future benchmarking.

Continuously assess the performance of LLMs in clinical settings to ensure safety.

Patients undergoing radiographic examinations.

Awareness of the limitations of LLMs in accurately detecting errors in radiology reports.

Engage board-certified radiologists in the verification of errors in reports.
Ensure that LLM outputs are reviewed by qualified professionals before clinical application.
Maintain a focus on both linguistic fluency and clinical accuracy in automated systems.

Clinical Scorecard: GPT-4.1 and Llama 3.3 70 Show Limitations in Identifying Clinically Significant Errors in Radiology Reports During Zero-Shot Assessment