Two-Phase Blinded Assessment of Large Language Models in Complex Cardiac Surgery: Evaluating Task-Specific Efficacy and Collaboration with Clinicians

By
Marc Leon
Ruibin Feng
Manuel Quiroz Flores
Glenn Pelletier
Daniel Bethencourt
Masafumi Shibata
Hao He
Chawannuch Ruaengsri
May 29, 2026

Frontiers In Digital Health

Overview

This study evaluates the performance of five large language models (LLMs) in complex cardiac surgery scenarios and examines human-LLM collaboration. Findings indicate that while reasoning-optimized LLMs outperformed others, significant clinical limitations and collaboration imbalances were observed.

Background

Large language models (LLMs) are increasingly utilized in healthcare, but their effectiveness in complex surgical decision-making remains largely unexplored. This study addresses critical gaps in understanding how LLMs perform in high-stakes environments like cardiac surgery, where nuanced decision-making is essential. Evaluating LLMs in this context is crucial for ensuring safe clinical integration and optimizing human-LLM collaboration.

Data Highlights

Model	Median Normalized Score
O1	0.896
O3-mini-high	0.854
DeepSeek-R1	0.792
GPT-4	0.667
Llama3-OpenBioLLM-70B	0.521

Key Findings

LLM performance varied across scenarios, with O1 achieving the highest median normalized score of 0.896.
Scenario comprehension scored highest among evaluation dimensions at 0.920.
Patient safety and hallucination avoidance scored lowest, at 0.507 and 0.549, respectively.
Second-round ratings showed a decline for four LLMs, with 7.57% of ratings revised from affirmative to negative.
Overacceptance of model reasoning by clinicians was identified as a significant collaboration imbalance.

Clinical Implications

The findings highlight the need for careful evaluation of LLM outputs in clinical settings, particularly in complex decision-making scenarios. Clinicians should remain vigilant to avoid overaccepting model-generated reasoning that may appear clinically sound but is incorrect.

Conclusion

While reasoning-optimized LLMs show promise, their current limitations and the identified collaboration imbalances indicate that they are not yet ready for safe use in complex surgical environments.

Two-Phase Blinded Assessment of Large Language Models in Complex Cardiac Surgery: Evaluating Task-Specific Efficacy and Collaboration with Clinicians

Clinical Report: Two-Phase Blinded Assessment of Large Language Models in Cardiac Surgery

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

Related Resources & Content

Original Source(s)

Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration

Related Content

Open-source large language model-based on-premises pipeline for automated data extraction from unstructured electronic health records: a pilot study

Surgeon Sleep Timing Tied to Risk

Higher vs. lower positive end-expiratory pressure during one-lung ventilation for thoracic surgery: a systematic review and meta-analysis