Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration - Report - MDSpire
Advertisement
Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration
Clinical Report: Two-Phase Blinded Assessment of Large Language Models in Cardiac Surgery
Overview
This study evaluates the performance of five large language models (LLMs) in complex cardiac surgery scenarios and examines human-LLM collaboration. Findings indicate that while reasoning-optimized LLMs outperformed others, significant clinical limitations and collaboration imbalances were observed.
Background
Large language models (LLMs) are increasingly utilized in healthcare, but their effectiveness in complex surgical decision-making remains largely unexplored. This study addresses critical gaps in understanding how LLMs perform in high-stakes environments like cardiac surgery, where nuanced decision-making is essential. Evaluating LLMs in this context is crucial for ensuring safe clinical integration and optimizing human-LLM collaboration.
Data Highlights
Model
Median Normalized Score
O1
0.896
O3-mini-high
0.854
DeepSeek-R1
0.792
GPT-4
0.667
Llama3-OpenBioLLM-70B
0.521
Key Findings
LLM performance varied across scenarios, with O1 achieving the highest median normalized score of 0.896.
Scenario comprehension scored highest among evaluation dimensions at 0.920.
Patient safety and hallucination avoidance scored lowest, at 0.507 and 0.549, respectively.
Second-round ratings showed a decline for four LLMs, with 7.57% of ratings revised from affirmative to negative.
Overacceptance of model reasoning by clinicians was identified as a significant collaboration imbalance.
Clinical Implications
The findings highlight the need for careful evaluation of LLM outputs in clinical settings, particularly in complex decision-making scenarios. Clinicians should remain vigilant to avoid overaccepting model-generated reasoning that may appear clinically sound but is incorrect.
Conclusion
While reasoning-optimized LLMs show promise, their current limitations and the identified collaboration imbalances indicate that they are not yet ready for safe use in complex surgical environments.
These 10 states make it more practical for physicians to participate in hospital ownership by aligning statutory structure, corporate practice of medicine rules, and population trends.