Evaluating large language model-generated brain MRI protocols: performance of GPT4o, o3-mini, DeepSeek-R1 and Qwen2.5-72B - Summary - MDSpire

Evaluating large language model-generated brain MRI protocols: performance of GPT4o, o3-mini, DeepSeek-R1 and Qwen2.5-72B

  • By

  • Su Hwan Kim

  • Severin Schramm

  • Lena Schmitzer

  • Kerem Serguen

  • Sebastian Ziegelmayer

  • Felix Busch

  • Alexander Komenda

  • Marcus R. Makowski

  • Lisa C. Adams

  • Keno K. Bressem

  • Claus Zimmer

  • Jan Kirschke

  • Benedikt Wiestler

  • Dennis Hedderich

  • Tom Finck

  • Jannis Bodden

  • September 3, 2025

  • 0 min

Share

Objective:

To evaluate the ability of large language models (LLMs) to suggest granular, sequence-level brain MRI protocols based on realistic clinical cases, addressing current challenges in MRI protocoling.

Key Findings:
  • LLMs can generate MRI protocols that align with expert-defined protocols, with varying performance based on context.
  • Inter-rater agreement among radiologists was assessed using Cohen’s kappa, indicating reliability.
  • The performance of LLMs varied based on the inclusion of additional context, highlighting the importance of contextual information.
Interpretation:

The study suggests that LLMs have the potential to assist in protocoling MRI scans, potentially reducing radiologist workload and improving efficiency, which is crucial in the face of increasing demand for MRI services.

Limitations:
  • The study used fictitious cases, which may not fully represent real-world complexities, potentially limiting the applicability of the findings.
  • Inter-rater reliability may not reflect broader clinical practice variations, suggesting a need for further validation.
Conclusion:

LLMs show promise in generating MRI protocols, which could enhance clinical workflows and reduce errors in protocoling, but further research is needed to validate these findings in real-world settings.

Original Source(s)

Related Content