Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study - Summary - MDSpire

Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study

  • By

  • Ming-Liang Wang

  • Rui-Peng Zhang

  • Wen-Juan Wu

  • Yu Lu

  • Xiao-Er Wei

  • Zheng Sun

  • Bao-Hui Guan

  • Jun-Jie Zhang

  • Xue Wu

  • Lei Zhang

  • Tian-Le Wang

  • Yue-Hua Li

  • January 22, 2026

  • 0 min

Share

Objective:

To evaluate the performance of large language models (LLMs) in generating diagnostic impressions from brain MRI report findings, highlighting their potential impact on clinical practice.

Key Findings:
  • DeepSeek-R1 achieved the highest performance across the dataset and clinical scenarios, demonstrating its effectiveness.
  • A top three differential-diagnosis prompting strategy resulted in 97.6% patient-level accuracy compared to 87.1% for single-diagnosis prompting, indicating the importance of prompting strategies.
  • Integration of DeepSeek-R1 improved diagnostic accuracy (AUPRC: 0.774–0.893) and reduced reading time from 61 to 53 seconds, showcasing efficiency gains.
Interpretation:

The study indicates that advanced LLMs like DeepSeek-R1 can effectively support automated diagnostic impression generation in brain MRI reporting, enhancing accuracy and efficiency, with significant implications for clinical practice.

Limitations:
  • The study's findings are based on a specific dataset and may not generalize to all clinical settings, which could limit applicability.
  • The performance of LLMs may vary with different prompting strategies and input types, suggesting a need for further research.
Conclusion:

Optimized prompting and input strategies can make LLMs a valuable tool in drafting brain MRI reports, potentially improving workflow efficiency in radiology and enhancing patient care.

Original Source(s)

Related Content