Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study

By
Ming-Liang Wang
Rui-Peng Zhang
Wen-Juan Wu
Yu Lu
Xiao-Er Wei
Zheng Sun
Bao-Hui Guan
Jun-Jie Zhang
Xue Wu
Lei Zhang
Tian-Le Wang
Yue-Hua Li
January 22, 2026
0 min

Npj Digital Medicine

Overview

This multicenter study evaluated 10 large language models (LLMs) on their ability to generate diagnostic impressions from brain MRI reports. DeepSeek-R1 demonstrated superior performance, especially when using structured findings and clinical data, and significantly improved radiologists' diagnostic accuracy and efficiency.

Background

Deriving accurate radiological diagnoses from brain MRI reports is complex and requires specialized expertise. Automated generation of diagnostic impressions could support radiologists by improving workflow and reducing errors. Large language models have shown promise in medical text interpretation but require rigorous benchmarking in brain MRI contexts. This study assesses multiple LLMs across diverse clinical scenarios and evaluates their integration with radiologist workflows.

Data Highlights

Metric	DeepSeek-R1	Other LLMs (Average)
Dataset Size	4293 reports, 9973 diagnostic labels
Brain Disease Categories	15 categories
Patient-level Accuracy (Top 3 Differential)	97.6%	87.1% (Single Diagnosis Prompting)
Radiologist AUPRC Without Assistance	0.774
Radiologist AUPRC With DeepSeek-R1 Assistance	0.893
Reading Time (Seconds)	53 (with assistance)	61 (without assistance)

Key Findings

DeepSeek-R1 outperformed nine other LLMs in generating diagnostic impressions from brain MRI reports across multiple centers.
Incorporating structured report findings and clinical information enhanced model diagnostic accuracy.
Using a top three differential-diagnosis prompting strategy improved patient-level accuracy to 97.6%, surpassing single-diagnosis prompting (87.1%).
Assisted radiologists showed significant improvement in diagnostic accuracy (AUPRC increased from 0.774 to 0.893) and reduced reading times.
Junior radiologists benefited more markedly from DeepSeek-R1 assistance, indicating potential to support less experienced clinicians.
The study provides publicly available source code and model weights to facilitate further research and deployment.

Clinical Implications

Advanced large language models like DeepSeek-R1 can effectively support radiologists by generating accurate diagnostic impressions from brain MRI reports, potentially reducing workload and diagnostic errors. Optimized prompting strategies and integration of clinical data are critical for maximizing model performance. Incorporating such AI tools into clinical workflows may enhance efficiency and diagnostic confidence, especially for junior radiologists.

Conclusion

This study demonstrates that state-of-the-art LLMs, particularly DeepSeek-R1, can reliably generate diagnostic impressions from complex brain MRI reports and improve radiologist performance. With appropriate implementation, these models hold promise as valuable adjuncts in neuroradiology practice.

References

Debette et al. 2019 -- Clinical significance of magnetic resonance imaging markers of vascular brain injury: a systematic review and meta-analysis
Whiting et al. 2006 -- Accuracy of magnetic resonance imaging for the diagnosis of multiple sclerosis: systematic review
Young & Knopp 2006 -- Brain MRI: tumor evaluation
Li et al. 2022 -- The key role of magnetic resonance imaging in the detection of neurodegenerative diseases-associated biomarkers: a review
Chen & Lexa 2017 -- Baseline survey of the neuroradiology work environment in the United States
Peng et al. 2022 -- Radiologist burnout: trends in medical imaging utilization under the national health insurance system
Kasalak et al. 2023 -- Work overload and diagnostic errors in radiology
Siewert & Ayyala 2025 -- Moral distress, moral injury, and burnout in radiology practice
Hosny et al. 2018 -- Artificial intelligence in radiology
Rao et al. 2025 -- Multimodal generative AI for medical image interpretation
Seah et al. 2025 -- Drafting the future: the dawn of AI report generation in radiology
Hu et al. 2025 -- Large language models in summarizing radiology report impressions for lung cancer in Chinese: evaluation study
Sheng et al. 2025 -- Large language models for diagnosing focal liver lesions from CT/MRI reports: a comparative study with radiologists
Dong et al. -- Keyword-based AI assistance in the generation of radiology reports: a pilot study

Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study

Assessment of Large Language Models for Diagnostic Impressions from Brain MRI Reports

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study

Related Content

AI Falls Short on Differential Dx

Elastography Enhances Diagnostic Accuracy of ACR TI-RADS in Thyroid Nodule Evaluation

A lightweight CVTC model for accurate Alzheimer’s MRI analysis and lesion annotation