Evaluating large language model-generated brain MRI protocols: performance of GPT4o, o3-mini, DeepSeek-R1 and Qwen2.5-72B - Report - MDSpire

Evaluating large language model-generated brain MRI protocols: performance of GPT4o, o3-mini, DeepSeek-R1 and Qwen2.5-72B

  • By

  • Su Hwan Kim

  • Severin Schramm

  • Lena Schmitzer

  • Kerem Serguen

  • Sebastian Ziegelmayer

  • Felix Busch

  • Alexander Komenda

  • Marcus R. Makowski

  • Lisa C. Adams

  • Keno K. Bressem

  • Claus Zimmer

  • Jan Kirschke

  • Benedikt Wiestler

  • Dennis Hedderich

  • Tom Finck

  • Jannis Bodden

  • September 3, 2025

  • 0 min

Share

Clinical Report: Evaluating LLMs for Brain MRI Protocol Generation

Overview

This study assessed the performance of four large language models (LLMs)—GPT-4o, o3-mini, DeepSeek-R1, and Qwen2.5-72B—in generating detailed brain MRI protocols from realistic clinical case descriptions. The models were evaluated against reference protocols established by experienced neuroradiologists, with and without enhanced contextual information, and compared to protocols generated by radiology residents.

Background

Brain MRI protocoling is a critical yet time-consuming task requiring radiologists to balance comprehensive imaging with efficiency to avoid repeat examinations and reduce costs. Errors in protocoling are a leading cause of callback MRI scans. With increasing MRI demand and radiologist workload, AI tools, including LLMs, have been explored to assist in protocol selection. Prior studies have focused on modality or single sequence suggestions, but granular sequence-level protocol generation based on realistic clinical cases remains underexplored.

Data Highlights

ModelTypeAccessTemperatureQuery Date
GPT-4oClosed-weightOpenAI API0Feb 6, 2025
o3-miniClosed-weightOpenAI APINot supportedFeb 16, 2025
DeepSeek-R1Open-weightFireworks AI0Feb 6, 2025
Qwen2.5-72BOpen-weightFireworks AI0Feb 6, 2025

Key Findings

  • Two board-certified neuroradiologists established reference brain MRI protocols for 150 anonymized, categorized clinical cases, with consensus adjudication for disagreements.
  • LLMs generated brain MRI protocols under two conditions: base (without external info) and enhanced (with local standard protocols and sequence explanations).
  • GPT-4o and o3-mini are closed-weight models accessed via OpenAI API; DeepSeek-R1 and Qwen2.5-72B are open-weight models accessed via Fireworks AI.
  • Structured JSON output mode and deterministic temperature settings were used to ensure consistent and analyzable protocol generation.
  • Radiology residents also generated protocols for comparison, highlighting the potential of LLMs to support or augment human protocoling.

Clinical Implications

LLMs show promise in automating the generation of detailed brain MRI protocols, potentially reducing radiologist workload and minimizing protocol errors that lead to repeat scans. Incorporating local protocol standards and sequence explanations enhances model performance, suggesting that tailored AI integration could improve clinical workflow efficiency. However, human oversight remains essential to ensure clinical appropriateness and safety.

Conclusion

This study demonstrates that state-of-the-art LLMs can generate clinically relevant brain MRI protocols from realistic case descriptions, with enhanced contextual input improving accuracy. These findings support further development and integration of LLM-based tools to assist radiologists in protocoling tasks.

References

  1. Wong et al. 2023 -- AI in Brain MRI Protocol Classification
  2. Suzuki et al. 2024 -- GPT-4 for Brain MRI Sequence Suggestion
  3. OpenAI API Documentation 2024 -- GPT-4o and o3-mini Models
  4. Fireworks AI Platform 2025 -- DeepSeek-R1 and Qwen2.5-72B Access

Original Source(s)

Related Content