Performance of deepseek-R1 and ChatGPT-5.4 thinking in the medical laboratory professional title examination: accuracy, stability, and comparison with interns - Scorecard - MDSpire
Advertisement
Performance of deepseek-R1 and ChatGPT-5.4 thinking in the medical laboratory professional title examination: accuracy, stability, and comparison with interns
Clinical Scorecard: Evaluation of Deepseek-R1 and ChatGPT-5.4 Performance in the Medical Laboratory Junior Professional Title Examination: A Comparison of Accuracy, Consistency, and Intern Results
At a Glance
Category
Detail
Condition
Medical Laboratory Junior Professional Title Examination
Key Mechanisms
Evaluation of AI models' accuracy and reproducibility in examination settings.
Target Population
Final-year medical laboratory interns and AI models.
Care Setting
Medical education and examination preparation.
Key Highlights
Deepseek-R1 outperformed ChatGPT-5.4 in accuracy across most examination papers.
Both AI models demonstrated strong reproducibility with Fleiss' kappa coefficients exceeding 0.7.
Interns performed comparably to AI models only on Paper I, scoring lower on others.
ChatGPT-5.4 exhibited significant cross-disciplinary differences in performance.
Stable knowledge gaps were identified through analysis of error types.
Guideline-Based Recommendations
Diagnosis
Management
Monitoring & Follow-up
Risks
Patient & Prescribing Data
Not applicable; study focused on AI models and interns.
AI models may serve as auxiliary tools for examination preparation.
Clinical Best Practices
Utilize AI models for personalized learning support in medical education.
Incorporate AI performance evaluations in the assessment of medical knowledge.