AI Falls Short on Differential Dx - Report - MDSpire

AI Falls Short on Differential Dx

  • By

  • Kathryn Wighton

  • April 13, 2026

  • 4 min

Share

AI Models Show High Final Diagnosis Accuracy but Poor Differential Diagnosis

Overview

A cross-sectional study evaluating 21 large language models (LLMs) found that while these models achieve high accuracy in final diagnoses, they consistently underperform in generating differential diagnoses. The newly developed PrIME-LLM metric revealed significant variability in reasoning tasks, highlighting a critical limitation in current AI clinical reasoning capabilities.

Background

Large language models are increasingly explored for clinical decision support, but traditional benchmarks focus mainly on final diagnosis accuracy, overlooking the stepwise reasoning process essential in clinical practice. Differential diagnosis involves maintaining and refining multiple possible conditions, a complex task that reflects clinical uncertainty. This study assessed LLMs across the full clinical workflow using standardized vignettes and a novel composite metric, PrIME-LLM, to better capture performance in differential diagnosis, diagnostic testing, management, and other reasoning tasks.

Data Highlights

ModelPrIME-LLM ScoreDifferential Diagnosis Failure RateFinal Diagnosis Failure Rate
Grok 40.78>80%<40%
Gemini 1.5 Flash0.64>80%<40%

Traditional accuracy measures ranged from approximately 81% to 90% across models, but differential diagnosis failure rates exceeded 80%, contrasting with less than 40% failure in final diagnosis tasks.

Key Findings

  • LLMs achieved high final diagnosis accuracy (81%-90%) but showed poor performance in differential diagnosis, with failure rates exceeding 80%.
  • The PrIME-LLM metric revealed wider performance variability across reasoning tasks than traditional accuracy metrics.
  • Diagnostic testing accuracy was intermediate, outperforming differential diagnosis but lagging behind final diagnosis.
  • Reasoning-optimized models outperformed nonreasoning models overall, with Grok 4 scoring highest on PrIME-LLM.
  • LLMs tended to prematurely converge on a single diagnosis rather than maintaining a differential, limiting their clinical reasoning fidelity.
  • Multimodal image-capable models showed some accuracy improvements on image-based questions, but text-only performance remained more consistent.

Clinical Implications

These findings underscore the current limitations of off-the-shelf LLMs in replicating the nuanced clinical reasoning process, particularly in generating and refining differential diagnoses. Clinicians should exercise caution when integrating AI tools into diagnostic workflows, recognizing that these models may provide accurate final answers but lack the reasoning transparency and uncertainty management essential for safe clinical decision-making.

Conclusion

Despite advances and reasoning optimizations, current LLMs fall short in differential diagnosis and comprehensive clinical reasoning, indicating they remain adjunct tools requiring physician oversight rather than autonomous decision-makers.

Related Resources & Content

  1. JAMA Network Open Original Investigation -- AI Falls Short on Differential Dx
  2. Tordjman M, Mei X. Invited Commentary, Icahn School of Medicine at Mount Sinai

Original Source(s)

Related Content