To evaluate the performance of large language models (LLMs) in differential diagnosis and other clinical reasoning tasks using a new composite metric, the Proportional Index of Medical Evaluation for LLMs (PrIME-LLM), which aims to provide a more comprehensive assessment than traditional benchmarks.
Key Findings:
PrIME-LLM scores ranged from 0.64 to 0.78, with reasoning-optimized models outperforming nonreasoning models, indicating a potential pathway for improving clinical AI.
Differential diagnosis had failure rates exceeding 80%, while final diagnosis tasks had less than 40% failure rates, highlighting critical areas for improvement.
Diagnostic testing performance was intermediate between differential diagnosis and final diagnosis, suggesting a need for targeted enhancements.
Final diagnosis tasks were more accurate than both diagnostic testing and differential diagnosis across nearly all models, underscoring the importance of model training in specific domains.
Interpretation:
LLMs struggle with maintaining and refining differential diagnoses, often converging prematurely on a single answer, which highlights significant limitations in processing clinical uncertainty and the need for improved reasoning capabilities.
Limitations:
Use of publicly available clinical vignettes that may have been included in model training, potentially biasing results.
Exclusion of augmented tools like retrieval systems that could enhance performance, limiting the applicability of findings.
Variability in model responses and inherent limitations of current architectures, which may affect the reliability of the results.
Conclusion:
Despite improvements, LLMs have not achieved the necessary intelligence for safe clinical deployment and should be used as supervised adjuncts, with physicians remaining the primary decision-makers, emphasizing the need for ongoing evaluation and refinement.