GPT-4.1 and Llama 3.3 70 fail to detect clinically relevant errors in radiology reports in zero-shot evaluation - Summary - MDSpire

GPT-4.1 and Llama 3.3 70 fail to detect clinically relevant errors in radiology reports in zero-shot evaluation

  • By

  • Tugba Akinci D’Antonoli

  • Lisa C. Adams

  • Jannik Lübberstedt

  • Markus M. Graf

  • Christian J. Mertens

  • Felix Busch

  • Sebastian Ziegelmayer

  • Marcus R. Makowski

  • Keno Bressem

  • Ina Luiken

  • June 19, 2026

  • 0 min

Share

Objective:

To evaluate the error detection capabilities of GPT-4.1 and Llama 3.3 70B in radiology reports, focusing on clinically relevant errors and the distinction between pattern-based and reasoning-dependent error types.

Approach:
    Key Findings:
    • LLMs showed limitations in identifying clinically significant errors in radiology reports.
    • Performance varied based on error type, with a notable gap between detecting formulaic errors and those requiring domain-specific reasoning.
    • The evaluation framework and dataset are publicly available for future benchmarking.
    Interpretation:

    Limitations:
    • The study was conducted in a zero-shot setting, which may not reflect real-world performance and limits the applicability of the findings.
    • Only two LLMs were evaluated, which may affect the generalizability of the results.
    Conclusion:

Original Source(s)

Related Content