AI Outperforms MDs on Reasoning Tasks

Across six experiments—including a blinded, real-world ER evaluation—an OpenAI large language model outperformed physician baselines on multiple clinical reasoning tasks, though not on key safety endpoints such as cannot-miss diagnoses

By
Kerri Miller
May 4, 2026
7 min

Conexiant

At a Glance

Category	Detail
Condition
Key Mechanisms	Large language model (LLM) performance comparison against physician baselines in emergency medicine.
Target Population
Care Setting

Key Highlights

o1-preview achieved 78% correct diagnosis inclusion in NEJM CPC cases, with context on exact or very close diagnoses.
Outperformed GPT-4 in identifying exact or very close diagnoses (89% vs 73%) with clarification.
o1 achieved median scores of 89% in management cases, significantly higher than conventional resources.
In ER evaluations, o1 identified exact or very close diagnoses in 67% of cases at initial triage.
Study emphasizes limitations in nontext inputs, generalizability to other specialties, and AI's challenges in identifying cannot-miss diagnoses.

Guideline-Based Recommendations

Diagnosis

Management

Monitoring & Follow-up

Risks

AI models may not perform well in identifying cannot-miss diagnoses; caution is advised.

Patient & Prescribing Data

AI models can assist in generating differential diagnoses but should not replace human judgment; human oversight is crucial.

Clinical Best Practices

Incorporate AI tools as adjuncts to physician expertise in emergency medicine.
Ensure ongoing evaluation of AI performance in real-world clinical settings, especially in diverse specialties.