Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration - Report - MDSpire

Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration

  • By

  • Marc Leon

  • Ruibin Feng

  • Manuel Quiroz Flores

  • Glenn Pelletier

  • Daniel Bethencourt

  • Masafumi Shibata

  • Hao He

  • Chawannuch Ruaengsri

  • May 29, 2026

  • 0 min

Share

Clinical Report: Two-Phase Blinded Assessment of Large Language Models in Cardiac Surgery

Overview

This study evaluates the performance of five large language models (LLMs) in complex cardiac surgery scenarios and examines human-LLM collaboration. Findings indicate that while reasoning-optimized LLMs outperformed others, significant clinical limitations and collaboration imbalances were observed.

Background

Large language models (LLMs) are increasingly utilized in healthcare, but their effectiveness in complex surgical decision-making remains largely unexplored. This study addresses critical gaps in understanding how LLMs perform in high-stakes environments like cardiac surgery, where nuanced decision-making is essential. Evaluating LLMs in this context is crucial for ensuring safe clinical integration and optimizing human-LLM collaboration.

Data Highlights

ModelMedian Normalized Score
O10.896
O3-mini-high0.854
DeepSeek-R10.792
GPT-40.667
Llama3-OpenBioLLM-70B0.521

Key Findings

  • LLM performance varied across scenarios, with O1 achieving the highest median normalized score of 0.896.
  • Scenario comprehension scored highest among evaluation dimensions at 0.920.
  • Patient safety and hallucination avoidance scored lowest, at 0.507 and 0.549, respectively.
  • Second-round ratings showed a decline for four LLMs, with 7.57% of ratings revised from affirmative to negative.
  • Overacceptance of model reasoning by clinicians was identified as a significant collaboration imbalance.

Clinical Implications

The findings highlight the need for careful evaluation of LLM outputs in clinical settings, particularly in complex decision-making scenarios. Clinicians should remain vigilant to avoid overaccepting model-generated reasoning that may appear clinically sound but is incorrect.

Conclusion

While reasoning-optimized LLMs show promise, their current limitations and the identified collaboration imbalances indicate that they are not yet ready for safe use in complex surgical environments.

Related Resources & Content

  1. Author(s)/Org, Source, Year -- Title
  2. Author(s)/Org, Source, Year -- Title
  3. Author(s)/Org, Source, Year -- Title
  4. Author(s)/Org, Source, Year -- Title
  5. 2025 ESC/EACTS Guidelines for the management of valvular heart disease
  6. STS Launches New Valve Surgery Risk Calculators | STS
  7. WHO releases AI ethics and governance guidance for large multi-modal models
  8. 2025 ESC/EACTS Guidelines for the management of valvular heart disease
  9. STS Launches New Valve Surgery Risk Calculators | STS
  10. WHO releases AI ethics and governance guidance for large multi-modal models

Original Source(s)

Related Content