AI Outperforms MDs on Reasoning Tasks - Summary - MDSpire

AI Outperforms MDs on Reasoning Tasks

  • By

  • Kerri Miller

  • May 4, 2026

  • 7 min

Share

Objective:

To evaluate the performance of an OpenAI large language model (LLM) in diagnostic and management reasoning compared to physician baselines across multiple measures.

Approach:
    Key Findings:
    • o1-preview included the correct diagnosis in 78% of NEJM CPC cases and listed it first in 52%.
    • o1-preview achieved a median score of 89% in management cases, outperforming GPT-4 (42%) and physicians using GPT-4 (41%).
    • In the ER study, o1 identified exact or very close diagnoses in 67% of cases at triage, surpassing attending physicians (55%).
    Interpretation:

    The findings suggest that LLMs have surpassed most benchmarks of clinical reasoning, indicating significant potential for AI to enhance clinical practice.

    Limitations:
    • The study focused solely on text-based performance, excluding nontext inputs like imaging and physical exams, which are critical in clinical settings.
    • The ER evaluation was a proof of concept and may not reflect real-world emergency medicine decisions, as it centered on predefined touchpoints.
    • Generalizability is limited to internal and emergency medicine, with historical data used for some comparisons, raising concerns about the applicability of findings.
    Conclusion:

    Further studies are needed to explore the impact of AI systems on clinical practice and patient outcomes, emphasizing the necessity for human-computer interaction studies and prospective clinical trials.

    Sources:

Original Source(s)

Related Content