Performance of deepseek-R1 and ChatGPT-5.4 thinking in the medical laboratory professional title examination: accuracy, stability, and comparison with interns - Summary - MDSpire

Performance of deepseek-R1 and ChatGPT-5.4 thinking in the medical laboratory professional title examination: accuracy, stability, and comparison with interns

  • By

  • Zhili Niu

  • Dongling Tang

  • Juanjuan Chen

  • Pingan Zhang

  • Chengliang Zhu

  • June 19, 2026

  • 0 min

Share

Objective:

To systematically evaluate the accuracy, reproducibility, and performance of Deepseek-R1 and ChatGPT-5.4 in the Medical Laboratory Junior Professional Title Examination and compare their performance with that of interns.

Approach:
    Key Findings:
    • Neither model showed significant differences in accuracy across the three repeated sessions (p > 0.05).
    • No significant differences in accuracy were observed across question types for either model (p > 0.05).
    • Across disciplines, Deepseek-R1 showed no significant differences across disciplines (p > 0.05), whereas ChatGPT-5.4 exhibited significant cross-disciplinary differences in Papers I, II, and III (p < 0.05).
    Interpretation:

    Deepseek-R1 demonstrated greater overall accuracy and consistency compared to ChatGPT-5.4.

    Limitations:
    • Theuseofpubliclyavailablehistoricalquestionslimitsconclusionsaboutthemodels'genuinereasoningability.
    Conclusion:

    Both Deepseek-R1 and ChatGPT-5.4 showed strong performance and reproducibility, with Deepseek-R1 outperforming interns and ChatGPT-5.4.

Original Source(s)

Related Content