Large language models for breast cancer treatment planning: a blinded real-world evaluation of DeepSeek, ChatGPT, and oncologist recommendations - Report - MDSpire

Large language models for breast cancer treatment planning: a blinded real-world evaluation of DeepSeek, ChatGPT, and oncologist recommendations

  • By

  • Ming Li

  • Yiran Yu

  • Gang Li

  • Xiaoli Zhang

  • Yuting Shi

  • Rila Su

  • June 30, 2026

  • 0 min

Share

Clinical Report: Evaluation of Large Language Models in Breast Cancer Treatment Planning

Overview

This study evaluates the accuracy and concordance of two large language models, DeepSeek V3.1 and ChatGPT-5, against oncologist recommendations in breast cancer treatment planning.

Background

Breast cancer remains the most prevalent cancer among women, necessitating effective treatment planning. The integration of large language models (LLMs) in oncology decision support is gaining attention, yet their real-world applicability and alignment with clinical practice require thorough investigation. This study addresses the performance of LLMs in generating treatment recommendations for breast cancer, particularly in complex cases.

Data Highlights

ModelAccuracy Score (Mean ± SD)Internal VarianceClinician Agreement
DeepSeek V3.14.91 ± 0.36Minimal74.2%
ChatGPT-54.65 ± 0.62HigherDeclined with stage
Clinicians3.82 ± 0.63HigherVaried by stage

Key Findings

  • DeepSeek V3.1 achieved the highest expert-rated accuracy scores (4.91 ± 0.36).
  • ChatGPT-5 scored lower than DeepSeek V3.1 (4.65 ± 0.62).
  • Clinician recommendations had the lowest accuracy score (3.82 ± 0.63).
  • AI outputs showed high mutual consistency at 74.2%.
  • AI-clinician agreement decreased significantly with advanced disease stages (P < 0.001).
  • In Stage IV cases, clinicians prioritized real-world constraints such as financial toxicity.

Clinical Implications

The findings highlight limitations in addressing complex clinical contexts and socioeconomic factors.

Conclusion

Advanced LLMs demonstrate strong performance in generating standardized breast cancer treatment plans.

Related Resources & Content

  1. Frontiers in Oncology, 2026 -- Clinical evaluation of large language model recommendations in melanoma: comparison with multidisciplinary tumor board decisions in a real-world cohort
  2. Frontiers in Medicine, 2026 -- Utility of large language models as information tools for nursing care in gout: a comparative study of DeepSeek and ChatGPT
  3. npj Digital Medicine, 2026 -- Collaboration Between Humans and Large Language Models in Clinical Practice: A Systematic Review and Meta-Analysis
  4. Frontiers in Medicine, 2026 -- Preliminary evaluation of DeepSeek-R1 and GPT-5.3 in selected PET/CT clinical scenarios: patient preparation, report interpretation, and diagnostic reasoning
  5. NCCN Guidelines® Insights: Breast Cancer, Version 5.2025 - PubMed
  6. CDK4/6 inhibitors show long-term benefit in early breast cancer
  7. Trastuzumab Deruxtecan after Endocrine Therapy in Metastatic Breast Cancer | New England Journal of Medicine
  8. NCCN Guidelines® Insights: Breast Cancer, Version 5.2025 - PubMed
  9. CDK4/6 inhibitors show long-term benefit in early breast cancer
  10. Trastuzumab Deruxtecan after Endocrine Therapy in Metastatic Breast Cancer | New England Journal of Medicine

Original Source(s)

Related Content