Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study - Report - MDSpire

Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study

  • By

  • Ming-Liang Wang

  • Rui-Peng Zhang

  • Wen-Juan Wu

  • Yu Lu

  • Xiao-Er Wei

  • Zheng Sun

  • Bao-Hui Guan

  • Jun-Jie Zhang

  • Xue Wu

  • Lei Zhang

  • Tian-Le Wang

  • Yue-Hua Li

  • January 22, 2026

  • 0 min

Share

Assessment of Large Language Models for Diagnostic Impressions from Brain MRI Reports

Overview

This multicenter study evaluated 10 large language models (LLMs) on their ability to generate diagnostic impressions from brain MRI reports. DeepSeek-R1 demonstrated superior performance, especially when using structured findings and clinical data, and significantly improved radiologists' diagnostic accuracy and efficiency.

Background

Deriving accurate radiological diagnoses from brain MRI reports is complex and requires specialized expertise. Automated generation of diagnostic impressions could support radiologists by improving workflow and reducing errors. Large language models have shown promise in medical text interpretation but require rigorous benchmarking in brain MRI contexts. This study assesses multiple LLMs across diverse clinical scenarios and evaluates their integration with radiologist workflows.

Data Highlights

MetricDeepSeek-R1Other LLMs (Average)
Dataset Size4293 reports, 9973 diagnostic labels
Brain Disease Categories15 categories
Patient-level Accuracy (Top 3 Differential)97.6%87.1% (Single Diagnosis Prompting)
Radiologist AUPRC Without Assistance0.774
Radiologist AUPRC With DeepSeek-R1 Assistance0.893
Reading Time (Seconds)53 (with assistance)61 (without assistance)

Key Findings

  • DeepSeek-R1 outperformed nine other LLMs in generating diagnostic impressions from brain MRI reports across multiple centers.
  • Incorporating structured report findings and clinical information enhanced model diagnostic accuracy.
  • Using a top three differential-diagnosis prompting strategy improved patient-level accuracy to 97.6%, surpassing single-diagnosis prompting (87.1%).
  • Assisted radiologists showed significant improvement in diagnostic accuracy (AUPRC increased from 0.774 to 0.893) and reduced reading times.
  • Junior radiologists benefited more markedly from DeepSeek-R1 assistance, indicating potential to support less experienced clinicians.
  • The study provides publicly available source code and model weights to facilitate further research and deployment.

Clinical Implications

Advanced large language models like DeepSeek-R1 can effectively support radiologists by generating accurate diagnostic impressions from brain MRI reports, potentially reducing workload and diagnostic errors. Optimized prompting strategies and integration of clinical data are critical for maximizing model performance. Incorporating such AI tools into clinical workflows may enhance efficiency and diagnostic confidence, especially for junior radiologists.

Conclusion

This study demonstrates that state-of-the-art LLMs, particularly DeepSeek-R1, can reliably generate diagnostic impressions from complex brain MRI reports and improve radiologist performance. With appropriate implementation, these models hold promise as valuable adjuncts in neuroradiology practice.

References

  1. Debette et al. 2019 -- Clinical significance of magnetic resonance imaging markers of vascular brain injury: a systematic review and meta-analysis
  2. Whiting et al. 2006 -- Accuracy of magnetic resonance imaging for the diagnosis of multiple sclerosis: systematic review
  3. Young & Knopp 2006 -- Brain MRI: tumor evaluation
  4. Li et al. 2022 -- The key role of magnetic resonance imaging in the detection of neurodegenerative diseases-associated biomarkers: a review
  5. Chen & Lexa 2017 -- Baseline survey of the neuroradiology work environment in the United States
  6. Peng et al. 2022 -- Radiologist burnout: trends in medical imaging utilization under the national health insurance system
  7. Kasalak et al. 2023 -- Work overload and diagnostic errors in radiology
  8. Siewert & Ayyala 2025 -- Moral distress, moral injury, and burnout in radiology practice
  9. Hosny et al. 2018 -- Artificial intelligence in radiology
  10. Rao et al. 2025 -- Multimodal generative AI for medical image interpretation
  11. Seah et al. 2025 -- Drafting the future: the dawn of AI report generation in radiology
  12. Hu et al. 2025 -- Large language models in summarizing radiology report impressions for lung cancer in Chinese: evaluation study
  13. Sheng et al. 2025 -- Large language models for diagnosing focal liver lesions from CT/MRI reports: a comparative study with radiologists
  14. Dong et al. -- Keyword-based AI assistance in the generation of radiology reports: a pilot study

Original Source(s)

Related Content