Preliminary evaluation of DeepSeek-R1 and GPT-5.3 in selected PET/CT clinical scenarios: patient preparation, report interpretation, and diagnostic reasoning - Summary - MDSpire

Preliminary evaluation of DeepSeek-R1 and GPT-5.3 in selected PET/CT clinical scenarios: patient preparation, report interpretation, and diagnostic reasoning

  • By

  • Runze Duan

  • Jing Pang

  • Lu Zheng

  • Ziyu Guo

  • Tianyue Li

  • Yanzhu Bian

  • Yujing Hu

  • June 11, 2026

  • 0 min

Share

Objective:

To evaluate the performance of DeepSeek (R1 version) in clinical scenarios and compare it with GPT-5.3.

Approach:
    Key Findings:
    • DeepSeek-R1 achieved 94.9% appropriateness and 100% helpfulness.
    • 91.7% of DeepSeek-R1's responses to follow-up inquiries were rated empathetic.
    • 7.7% of DeepSeek-R1's responses showed substantial inconsistencies, primarily in tumor staging.
    • GPT-5.3 exhibited equivalent core performance with 94.9% appropriateness and 100% helpfulness but lower empathy (66.7%).
    • Both models had similar primary diagnosis accuracy (10%) and differential diagnosis accuracy (60%).
    Interpretation:

    Limitations:
    • DeepSeek-R1 had a 7.7% inconsistency rate in responses.
    • Only 37% of DeepSeek-R1's cited references were fully valid.
    • Both models showed limitations in diagnostic accuracy.
    Conclusion:

Original Source(s)

Related Content