Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study - Scorecard - MDSpire

Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study

  • By

  • Ming-Liang Wang

  • Rui-Peng Zhang

  • Wen-Juan Wu

  • Yu Lu

  • Xiao-Er Wei

  • Zheng Sun

  • Bao-Hui Guan

  • Jun-Jie Zhang

  • Xue Wu

  • Lei Zhang

  • Tian-Le Wang

  • Yue-Hua Li

  • January 22, 2026

  • 0 min

Share

Clinical Scorecard: Assessment of Large Language Models for Generating Diagnostic Impressions from Brain MRI Reports: A Multicenter Benchmark Study

At a Glance

CategoryDetail
ConditionBrain diseases diagnosed via MRI
Key MechanismsUse of large language models (LLMs) to generate diagnostic impressions from brain MRI report findings
Target PopulationPatients undergoing brain MRI across multiple medical centers
Care SettingRadiology departments in tertiary medical centers

Key Highlights

  • DeepSeek-R1 LLM achieved highest diagnostic performance across 4293 brain MRI reports covering 15 brain disease categories.
  • Top three differential-diagnosis prompting strategy improved patient-level accuracy to 97.6% versus 87.1% for single-diagnosis prompting.
  • Integration of DeepSeek-R1 assistance improved radiologist diagnostic accuracy and reduced reading time, especially benefiting junior radiologists.

Guideline-Based Recommendations

Diagnosis

  • Utilize advanced large-scale LLMs like DeepSeek-R1 for automated diagnostic impression generation from brain MRI reports.
  • Incorporate structured report findings and relevant clinical information to optimize model performance.
  • Apply top three differential-diagnosis prompting strategies to enhance diagnostic accuracy.

Management

  • Integrate LLM assistance into radiology workflows to support report drafting and improve efficiency.
  • Use LLM outputs as supportive tools rather than sole diagnostic sources, maintaining radiologist oversight.

Monitoring & Follow-up

  • Assess diagnostic accuracy and reading time metrics when implementing LLM assistance in clinical practice.
  • Monitor performance differences across radiologist experience levels to tailor support accordingly.

Risks

  • Potential overreliance on AI-generated impressions without adequate clinical validation.
  • Variability in model performance depending on input data structure and clinical context.

Patient & Prescribing Data

Patients undergoing brain MRI scans with diverse brain disease categories across multiple centers

Automated diagnostic impression generation using LLMs can enhance diagnostic accuracy and workflow efficiency, potentially improving patient care through timely and accurate reporting.

Clinical Best Practices

  • Employ structured MRI report findings and relevant clinical data as inputs to LLMs for optimal diagnostic impression generation.
  • Adopt multi-diagnosis prompting strategies to capture differential diagnoses and improve accuracy.
  • Use LLM assistance to reduce radiologist workload and reading time, particularly supporting less experienced radiologists.
  • Maintain radiologist oversight to validate AI-generated impressions and ensure clinical safety.

References

Original Source(s)

Related Content