Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration - Summary - MDSpire

Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration

  • By

  • Marc Leon

  • Ruibin Feng

  • Manuel Quiroz Flores

  • Glenn Pelletier

  • Daniel Bethencourt

  • Masafumi Shibata

  • Hao He

  • Chawannuch Ruaengsri

  • May 29, 2026

  • 0 min

Share

Objective:

To assess the performance of large language models (LLMs) in complex surgical decision-making and evaluate human–LLM collaboration in cardiac surgery.

Key Findings:
  • LLM performance varied, with median normalized scores highest for O1 (0.896) and lowest for Llama3-OpenBioLLM-70B (0.521).
  • Scenario comprehension scored highest (0.920), while patient safety (0.507) and hallucination avoidance (0.549) scored lowest.
  • Second-round ratings showed a decline for four LLMs, with 7.57% of ratings revised from affirmative to negative.
Interpretation:

Remove unsupported claims and rephrase to reflect only findings.

Limitations:
  • The study involved a limited number of scenarios and LLMs, which may not represent all clinical contexts.
  • The evaluation framework may not capture all aspects of clinical decision-making.
Conclusion:

Rephrase to avoid unsupported claims about readiness for safe use.

Original Source(s)

Related Content