GPT-4.1 and Llama 3.3 70 fail to detect clinically relevant errors in radiology reports in zero-shot evaluation - Summary - MDSpire

GPT-4.1 and Llama 3.3 70 fail to detect clinically relevant errors in radiology reports in zero-shot evaluation

By
Tugba Akinci D’Antonoli
Lisa C. Adams
Jannik Lübberstedt
Markus M. Graf
Christian J. Mertens
Felix Busch
Sebastian Ziegelmayer
Marcus R. Makowski
Keno Bressem
Ina Luiken
June 19, 2026
0 min

European Radiology

Share

Objective:

To evaluate the error detection capabilities of GPT-4.1 and Llama 3.3 70B in radiology reports, focusing on clinically relevant errors and the distinction between pattern-based and reasoning-dependent error types.

Approach:

Key Findings:

LLMs showed limitations in identifying clinically significant errors in radiology reports.
Performance varied based on error type, with a notable gap between detecting formulaic errors and those requiring domain-specific reasoning.
The evaluation framework and dataset are publicly available for future benchmarking.

Interpretation:

Limitations:

The study was conducted in a zero-shot setting, which may not reflect real-world performance and limits the applicability of the findings.
Only two LLMs were evaluated, which may affect the generalizability of the results.

Conclusion:

Original Source(s)

European Radiology

GPT-4.1 and Llama 3.3 70 fail to detect clinically relevant errors in radiology reports in zero-shot evaluation

by Tugba Akinci D’Antonoli, Lisa C. Adams, Jannik Lübberstedt, Markus M. Graf, Christian J. Mertens, Felix Busch, Sebastian Ziegelmayer, Marcus R. Makowski, Keno Bressem, Ina Luiken
June 19, 2026

Related Content

Frontiers In Neurology

Multimodal ultrasound-based morphological differences between symptomatic and asymptomatic carotid web

by Chenyang Dai, Shihao Ruan, Linlin Li, Yuanyuan Tang, Lu Wang, Kai Wang
June 24, 2026

Frontiers In Medicine

Advances in diagnosis of lung fibrosis: focus on present and future approaches

by Tarig Fadelelmoula, Hamdi Al Mutori, Khalid Mohammed, Mazin Saleh, Ali Al Reesi
June 23, 2026

Frontiers In Oncology

Explainable incremental-value analysis of apparent diffusion coefficient and arterial spin labeling radiomics for ATRX status prediction in glioblastoma