Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study

By
Ming-Liang Wang
Rui-Peng Zhang
Wen-Juan Wu
Yu Lu
Xiao-Er Wei
Zheng Sun
Bao-Hui Guan
Jun-Jie Zhang
Xue Wu
Lei Zhang
Tian-Le Wang
Yue-Hua Li
January 22, 2026
0 min

Npj Digital Medicine

Objective:

To evaluate the performance of large language models (LLMs) in generating diagnostic impressions from brain MRI report findings, highlighting their potential impact on clinical practice.

Key Findings:

DeepSeek-R1 achieved the highest performance across the dataset and clinical scenarios, demonstrating its effectiveness.
A top three differential-diagnosis prompting strategy resulted in 97.6% patient-level accuracy compared to 87.1% for single-diagnosis prompting, indicating the importance of prompting strategies.
Integration of DeepSeek-R1 improved diagnostic accuracy (AUPRC: 0.774–0.893) and reduced reading time from 61 to 53 seconds, showcasing efficiency gains.

Interpretation:

The study indicates that advanced LLMs like DeepSeek-R1 can effectively support automated diagnostic impression generation in brain MRI reporting, enhancing accuracy and efficiency, with significant implications for clinical practice.

Limitations:

The study's findings are based on a specific dataset and may not generalize to all clinical settings, which could limit applicability.
The performance of LLMs may vary with different prompting strategies and input types, suggesting a need for further research.

Conclusion:

Optimized prompting and input strategies can make LLMs a valuable tool in drafting brain MRI reports, potentially improving workflow efficiency in radiology and enhancing patient care.

Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study

Objective:

Key Findings:

Interpretation:

Limitations:

Conclusion:

Original Source(s)

Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study

Related Content

Stepwise clinical and diagnostic strategy for coma of unknown origin

Use of gadolinium-based contrast agents in head and neck cancer diagnosis, staging, and monitoring: current applications and future perspectives

AI Falls Short on Differential Dx