Performance of deepseek-R1 and ChatGPT-5.4 thinking in the medical laboratory professional title examination: accuracy, stability, and comparison with interns - Report - MDSpire

Performance of deepseek-R1 and ChatGPT-5.4 thinking in the medical laboratory professional title examination: accuracy, stability, and comparison with interns

  • By

  • Zhili Niu

  • Dongling Tang

  • Juanjuan Chen

  • Pingan Zhang

  • Chengliang Zhu

  • June 19, 2026

  • 0 min

Share

Clinical Report: Evaluation of Deepseek-R1 and ChatGPT-5.4 Performance

Overview

This study evaluates the accuracy and reproducibility of Deepseek-R1 and ChatGPT-5.4 in the Medical Laboratory Junior Professional Title Examination, comparing their performance with that of interns.

Background

The integration of artificial intelligence in medical education is transforming knowledge delivery and assessment methods. Evaluating the performance of AI models on standardized examinations is crucial for understanding their role in medical training. This study focuses on the performance of two AI models in a significant medical examination context.

Data Highlights

ModelAccuracy ComparisonReproducibility
Deepseek-R1Higher accuracy than ChatGPT-5.4 in Papers I, II, IIIFleiss' kappa > 0.7
ChatGPT-5.4Significant cross-disciplinary differences in Papers I, II, IIIFleiss' kappa > 0.7
InternsPerformed comparably to AI in Paper I, lower in Papers II, III, IVN/A

Key Findings

  • Both models showed good reproducibility with Fleiss' kappa coefficients exceeding 0.7.
  • No significant differences in accuracy were found across question types for either model.
  • Deepseek-R1 outperformed ChatGPT-5.4 in Papers I, II, and III.
  • Interns performed comparably to AI models only in Paper I.
  • Deepseek-R1 exhibited the highest overall performance across the examination.

Clinical Implications

The findings suggest that AI models like Deepseek-R1 and ChatGPT-5.4 can serve as effective tools for examination preparation in medical education. Their performance indicates potential utility in enhancing learning outcomes for medical students.

Conclusion

Deepseek-R1 and ChatGPT-5.4 demonstrated strong performance in the Medical Laboratory Junior Professional Title Examination, with Deepseek-R1 showing superior accuracy.

Related Resources & Content

  1. Journal of Medical Internet Research (JMIR), 2026 -- Performance Evaluation of GPT-5, Grok 4, and DeepSeek R1 in Interpreting Complete Blood Count Reports for Hematologic Diseases: Retrospective Comparative Study
  2. Frontiers in Digital Health, 2026 -- Evaluation of ChatGPT-5 and DeepSeek in the Chinese Senior Professional Title Examination for Ultrasound Medicine
  3. DIGITAL HEALTH, 2026 -- Head-to-head evaluation of ChatGPT, DeepSeek, and Perplexity on acid–base disorder case clinical management and drug treatment: Accuracy, domain performance, and response consistency assessment
  4. Frontiers in Medicine, 2026 -- Preliminary evaluation of DeepSeek-R1 and GPT-5.3 in selected PET/CT clinical scenarios: patient preparation, report interpretation, and diagnostic reasoning
  5. FDA, 2025 -- Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence-Enabled Device Software Functions
  6. PMC, 2026 -- Performance of ChatGPT and Large Language Models on Medical Licensing Exams Worldwide: A Systematic Review and Network Meta-Analysis With Meta-Regression
  7. ADLM, 2026 -- ADLM releases position statement on responsible AI policy in laboratory medicine
  8. Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence-Enabled Device Software Functions | FDA
  9. Performance of ChatGPT and Large Language Models on Medical Licensing Exams Worldwide: A Systematic Review and Network Meta-Analysis With Meta-Regression - PMC
  10. ADLM releases position statement on responsible AI policy in laboratory medicine | myadlm.org

Original Source(s)

Related Content