GPT-4.1 and Llama 3.3 70 fail to detect clinically relevant errors in radiology reports in zero-shot evaluation
-
By
-
Tugba Akinci D’Antonoli
-
Lisa C. Adams
-
Jannik Lübberstedt
-
Markus M. Graf
-
Christian J. Mertens
-
Felix Busch
-
Sebastian Ziegelmayer
-
Marcus R. Makowski
-
Keno Bressem
-
Ina Luiken
-
June 19, 2026
-
0 min