AI Outperforms MDs on Reasoning Tasks

Across six experiments—including a blinded, real-world ER evaluation—an OpenAI large language model outperformed physician baselines on multiple clinical reasoning tasks, though not on key safety endpoints such as cannot-miss diagnoses

By
Kerri Miller
May 4, 2026
7 min

Conexiant

Objective:

To evaluate the performance of an OpenAI large language model (LLM) in diagnostic and management reasoning compared to physician baselines across multiple measures.

Approach:

Key Findings:

o1-preview included the correct diagnosis in 78% of NEJM CPC cases and listed it first in 52%.
o1-preview achieved a median score of 89% in management cases, outperforming GPT-4 (42%) and physicians using GPT-4 (41%).
In the ER study, o1 identified exact or very close diagnoses in 67% of cases at triage, surpassing attending physicians (55%).

Interpretation:

The findings suggest that LLMs have surpassed most benchmarks of clinical reasoning, indicating significant potential for AI to enhance clinical practice.

Limitations:

The study focused solely on text-based performance, excluding nontext inputs like imaging and physical exams, which are critical in clinical settings.
The ER evaluation was a proof of concept and may not reflect real-world emergency medicine decisions, as it centered on predefined touchpoints.
Generalizability is limited to internal and emergency medicine, with historical data used for some comparisons, raising concerns about the applicability of findings.

Conclusion:

Further studies are needed to explore the impact of AI systems on clinical practice and patient outcomes, emphasizing the necessity for human-computer interaction studies and prospective clinical trials.