Two-Phase Blinded Assessment of Large Language Models in Complex Cardiac Surgery: Evaluating Task-Specific Efficacy and Collaboration with Clinicians

By
Marc Leon
Ruibin Feng
Manuel Quiroz Flores
Glenn Pelletier
Daniel Bethencourt
Masafumi Shibata
Hao He
Chawannuch Ruaengsri
May 29, 2026

Frontiers In Digital Health

Objective:

To assess the performance of large language models (LLMs) in complex surgical decision-making and evaluate human–LLM collaboration in cardiac surgery.

Key Findings:

LLM performance varied, with median normalized scores highest for O1 (0.896) and lowest for Llama3-OpenBioLLM-70B (0.521).
Scenario comprehension scored highest (0.920), while patient safety (0.507) and hallucination avoidance (0.549) scored lowest.
Second-round ratings showed a decline for four LLMs, with 7.57% of ratings revised from affirmative to negative.

Interpretation:

Remove unsupported claims and rephrase to reflect only findings.

Limitations:

The study involved a limited number of scenarios and LLMs, which may not represent all clinical contexts.
The evaluation framework may not capture all aspects of clinical decision-making.

Conclusion:

Rephrase to avoid unsupported claims about readiness for safe use.

Two-Phase Blinded Assessment of Large Language Models in Complex Cardiac Surgery: Evaluating Task-Specific Efficacy and Collaboration with Clinicians

Objective:

Key Findings:

Interpretation:

Limitations:

Conclusion:

Original Source(s)

Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration

Related Content

Clinical efficacy and safety of different video-assisted thoracoscopic surgery approaches for bullous lung resection: a systematic review and meta-analysis

Higher vs. lower positive end-expiratory pressure during one-lung ventilation for thoracic surgery: a systematic review and meta-analysis

Late infection of thoracic aortic stent graft with aerodigestive fistula: a case series and narrative literature review