Performance of deepseek-R1 and ChatGPT-5.4 thinking in the medical laboratory professional title examination: accuracy, stability, and comparison with interns

By
Zhili Niu
Dongling Tang
Juanjuan Chen
Pingan Zhang
Chengliang Zhu
June 19, 2026
0 min

Frontiers In Digital Health

Overview

This study evaluates the accuracy and reproducibility of Deepseek-R1 and ChatGPT-5.4 in the Medical Laboratory Junior Professional Title Examination, comparing their performance with that of interns.

Background

The integration of artificial intelligence in medical education is transforming knowledge delivery and assessment methods. Evaluating the performance of AI models on standardized examinations is crucial for understanding their role in medical training. This study focuses on the performance of two AI models in a significant medical examination context.

Data Highlights

Model	Accuracy Comparison	Reproducibility
Deepseek-R1	Higher accuracy than ChatGPT-5.4 in Papers I, II, III	Fleiss' kappa > 0.7
ChatGPT-5.4	Significant cross-disciplinary differences in Papers I, II, III	Fleiss' kappa > 0.7
Interns	Performed comparably to AI in Paper I, lower in Papers II, III, IV	N/A

Key Findings

Both models showed good reproducibility with Fleiss' kappa coefficients exceeding 0.7.
No significant differences in accuracy were found across question types for either model.
Deepseek-R1 outperformed ChatGPT-5.4 in Papers I, II, and III.
Interns performed comparably to AI models only in Paper I.
Deepseek-R1 exhibited the highest overall performance across the examination.

Clinical Implications

The findings suggest that AI models like Deepseek-R1 and ChatGPT-5.4 can serve as effective tools for examination preparation in medical education. Their performance indicates potential utility in enhancing learning outcomes for medical students.

Conclusion

Deepseek-R1 and ChatGPT-5.4 demonstrated strong performance in the Medical Laboratory Junior Professional Title Examination, with Deepseek-R1 showing superior accuracy.

Performance of deepseek-R1 and ChatGPT-5.4 thinking in the medical laboratory professional title examination: accuracy, stability, and comparison with interns

Clinical Report: Evaluation of Deepseek-R1 and ChatGPT-5.4 Performance

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

Related Resources & Content

Original Source(s)

Performance of deepseek-R1 and ChatGPT-5.4 thinking in the medical laboratory professional title examination: accuracy, stability, and comparison with interns

Related Content

Machine learning-driven identification and immunohistochemical validation of an integrated immune-inflammatory phenotype for disease-free survival stratification in breast cancer

HHS Announces Department-Wide Clinical Trial Initiative

Correction: Clinical performance of a dual-target SARS CoV-2 antibody assay using sera from Ghana