Clinical Scorecard: AI Falls Short on Differential Dx
At a Glance
Category
Detail
Condition
Clinical diagnostic reasoning using AI large language models
Key Mechanisms
Evaluation of LLMs across differential diagnosis, diagnostic testing, final diagnosis, management, and clinical reasoning tasks using the PrIME-LLM metric
Target Population
Clinical scenarios represented by standardized vignettes from the MSD Manual
Care Setting
Clinical decision-making environments where AI tools might assist diagnosis and management
Key Highlights
LLMs achieved high accuracy on final diagnosis tasks (81%-90%) but performed poorly on differential diagnosis with failure rates >80%.
Reasoning-optimized models outperformed nonreasoning models overall, but all struggled with maintaining and refining differential diagnoses.
Multimodal image-capable models showed mixed improvements; text-only performance was more stable.
Guideline-Based Recommendations
Diagnosis
Current LLMs should not be relied upon for generating comprehensive differential diagnoses due to high failure rates.
Physicians must maintain primary responsibility for diagnostic reasoning and decision-making.
Management
LLMs may assist with management tasks but require careful supervision and validation by clinicians.
Monitoring & Follow-up
Ongoing evaluation of AI tools using metrics that assess the full clinical workflow, including reasoning processes, is essential.
Risks
Premature convergence on single diagnoses by LLMs can lead to missed alternative diagnoses.
Variability and hallucinations in LLM outputs pose risks for clinical deployment without oversight.
Patient & Prescribing Data
Simulated patients represented by standardized clinical vignettes
LLMs showed intermediate accuracy in management tasks but lack demonstrated advanced clinical reasoning for safe autonomous use.
Clinical Best Practices
Use LLMs as adjunct tools under direct physician supervision rather than autonomous decision-makers.
Evaluate AI model outputs critically, especially differential diagnoses, to avoid premature diagnostic closure.
Incorporate evaluation frameworks like PrIME-LLM that assess reasoning across the clinical workflow.
Remain cautious of variability and hallucinations inherent in current LLM architectures.
Prioritize physician judgment and clinical expertise over AI-generated conclusions.