AI Falls Short on Differential Dx
New PrIME-LLM benchmark shows strong diagnostic accuracy but persistent gaps in clinical reasoning across 21 large language models
-
By
-
Kathryn Wighton
-
April 13, 2026
-
-
1
AI models produced accurate final diagnoses but struggled significantly with differential diagnosis in clinical scenarios.
-
2
The study evaluated 21 large language models using a new metric, PrIME-LLM, to assess performance across the clinical workflow.
-
3
Differential diagnosis tasks had failure rates exceeding 80%, while final diagnosis tasks had failure rates below 40%.
-
4
Current evaluation methods may overestimate AI models' clinical readiness by focusing on final answers rather than reasoning processes.
-
5
Despite advancements, off-the-shelf AI models lack the intelligence for safe clinical deployment and should be supervised by physicians.