Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study

1

The study evaluated 10 large language models for generating diagnoses from 4293 brain MRI reports across 15 disease categories.
2

DeepSeek-R1 outperformed other models, achieving the highest accuracy with structured report findings and clinical information.
3

A top three differential-diagnosis prompting strategy yielded 97.6% patient-level accuracy, significantly higher than single-diagnosis prompting.
4

Integration of DeepSeek-R1 improved diagnostic accuracy and reduced reading time, especially benefiting junior radiologists.
5

The findings suggest that advanced LLMs like DeepSeek-R1 can enhance workflow efficiency in radiology by supporting MRI report drafting.

Npj Digital Medicine

by Ming-Liang Wang, Rui-Peng Zhang, Wen-Juan Wu, Yu Lu, Xiao-Er Wei, Zheng Sun, Bao-Hui Guan, Jun-Jie Zhang, Xue Wu, Lei Zhang, Tian-Le Wang, Yue-Hua Li
January 22, 2026

1