AI Falls Short on Differential Dx

New PrIME-LLM benchmark shows strong diagnostic accuracy but persistent gaps in clinical reasoning across 21 large language models

Category	Detail
Condition	Clinical diagnostic reasoning using AI large language models
Key Mechanisms	Evaluation of LLMs across differential diagnosis, diagnostic testing, final diagnosis, management, and clinical reasoning tasks using the PrIME-LLM metric
Target Population	Clinical scenarios represented by standardized vignettes from the MSD Manual
Care Setting	Clinical decision-making environments where AI tools might assist diagnosis and management

LLMs achieved high accuracy on final diagnosis tasks (81%-90%) but performed poorly on differential diagnosis with failure rates >80%.
Reasoning-optimized models outperformed nonreasoning models overall, but all struggled with maintaining and refining differential diagnoses.
Multimodal image-capable models showed mixed improvements; text-only performance was more stable.

Current LLMs should not be relied upon for generating comprehensive differential diagnoses due to high failure rates.
Physicians must maintain primary responsibility for diagnostic reasoning and decision-making.

LLMs may assist with management tasks but require careful supervision and validation by clinicians.

Ongoing evaluation of AI tools using metrics that assess the full clinical workflow, including reasoning processes, is essential.

Premature convergence on single diagnoses by LLMs can lead to missed alternative diagnoses.
Variability and hallucinations in LLM outputs pose risks for clinical deployment without oversight.

Simulated patients represented by standardized clinical vignettes

LLMs showed intermediate accuracy in management tasks but lack demonstrated advanced clinical reasoning for safe autonomous use.

Use LLMs as adjunct tools under direct physician supervision rather than autonomous decision-makers.
Evaluate AI model outputs critically, especially differential diagnoses, to avoid premature diagnostic closure.
Incorporate evaluation frameworks like PrIME-LLM that assess reasoning across the clinical workflow.
Remain cautious of variability and hallucinations inherent in current LLM architectures.
Prioritize physician judgment and clinical expertise over AI-generated conclusions.