Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study - Takeaways - MDSpire

Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study

  • By

  • Ming-Liang Wang

  • Rui-Peng Zhang

  • Wen-Juan Wu

  • Yu Lu

  • Xiao-Er Wei

  • Zheng Sun

  • Bao-Hui Guan

  • Jun-Jie Zhang

  • Xue Wu

  • Lei Zhang

  • Tian-Le Wang

  • Yue-Hua Li

  • January 22, 2026

  • 0 min

Share

  • 1

    The study evaluated 10 large language models for generating diagnoses from 4293 brain MRI reports across 15 disease categories.

  • 2

    DeepSeek-R1 outperformed other models, achieving the highest accuracy with structured report findings and clinical information.

  • 3

    A top three differential-diagnosis prompting strategy yielded 97.6% patient-level accuracy, significantly higher than single-diagnosis prompting.

  • 4

    Integration of DeepSeek-R1 improved diagnostic accuracy and reduced reading time, especially benefiting junior radiologists.

  • 5

    The findings suggest that advanced LLMs like DeepSeek-R1 can enhance workflow efficiency in radiology by supporting MRI report drafting.

Original Source(s)

Related Content