Large language models for breast cancer treatment planning: a blinded real-world evaluation of DeepSeek, ChatGPT, and oncologist recommendations - Summary - MDSpire

Large language models for breast cancer treatment planning: a blinded real-world evaluation of DeepSeek, ChatGPT, and oncologist recommendations

  • By

  • Ming Li

  • Yiran Yu

  • Gang Li

  • Xiaoli Zhang

  • Yuting Shi

  • Rila Su

  • June 30, 2026

  • 0 min

Share

Objective:

To evaluate and compare the accuracy, stability, and concordance of two advanced LLMs—DeepSeek V3.1 and ChatGPT-5—against experienced oncologists in generating breast cancer treatment plans.

Approach:
  • Study Design: Retrospective study using de-identified records from 213 breast cancer patients (Stages I–IV).
  • Evaluation Framework: Multidimensional evaluation framework assessing accuracy, internal consistency, and clinical concordance.
  • Statistical Analysis: Utilized ANOVA and ordinal regression to examine the impact of disease stage on AI-human agreement.
Key Findings:
  • DeepSeek V3.1 achieved the highest expert-rated accuracy scores (4.91 ± 0.36), outperforming ChatGPT-5 (4.65 ± 0.62) and clinicians (3.82 ± 0.63, P < 0.001).
  • AI outputs exhibited high mutual consistency (74.2%).
  • Expert evaluations showed a significant decline in AI-clinician agreement as disease stage advanced (P < 0.001), especially in Stage IV cases.
Interpretation:

Limitations:
  • The widening gap in complex late-stage cases highlights limitations in accounting for clinical context and socioeconomic factors.
Conclusion:

Original Source(s)

Related Content