Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration - Takeaways - MDSpire

Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration

  • By

  • Marc Leon

  • Ruibin Feng

  • Manuel Quiroz Flores

  • Glenn Pelletier

  • Daniel Bethencourt

  • Masafumi Shibata

  • Hao He

  • Chawannuch Ruaengsri

  • May 29, 2026

  • 0 min

Share

  • 1

    A two-phase evaluation framework was developed to assess large language models (LLMs) and human-LLM collaboration in complex cardiac surgery.

  • 2

    Fifteen high-fidelity cardiac surgery scenarios were created by senior surgeons, each paired with a reasoning task and expert-curated reference answers.

  • 3

    LLM performance varied, with median normalized scores highest for O1 (0.896) and lowest for Llama3-OpenBioLLM-70B (0.521) across scenarios.

  • 4

    Second-round evaluations showed a decline in scores for four LLMs, with a notable percentage of ratings revised from affirmative to negative.

  • 5

    All models exhibited clinical limitations, particularly in complex reasoning tasks, indicating they are not yet ready for safe use in surgical settings.

Original Source(s)

Related Content