Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration - Scorecard - MDSpire

Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration

  • By

  • Marc Leon

  • Ruibin Feng

  • Manuel Quiroz Flores

  • Glenn Pelletier

  • Daniel Bethencourt

  • Masafumi Shibata

  • Hao He

  • Chawannuch Ruaengsri

  • May 29, 2026

  • 0 min

Share

Clinical Scorecard: Two-Phase Blinded Assessment of Large Language Models in Complex Cardiac Surgery: Evaluating Task-Specific Efficacy and Collaboration with Clinicians

At a Glance

CategoryDetail
Condition
Key MechanismsEvaluation of large language models (LLMs) in surgical decision-making and human-LLM collaboration.
Target Population
Care Setting

Key Highlights

  • LLM performance varied, with O1 scoring highest (0.896) and Llama3-OpenBioLLM-70B lowest (0.521).
  • Scenario comprehension scored highest (0.920), while patient safety scored lowest (0.507).
  • Second-round ratings showed a decline for four LLMs.

Guideline-Based Recommendations

Diagnosis

  • Rigorous validation of LLMs is necessary for safe clinical integration.

Management

  • Evaluate LLM outputs critically.

Monitoring & Follow-up

  • Assess human-LLM collaboration.

Risks

  • Potential for clinicians to over-accept model reasoning.

Patient & Prescribing Data

LLMs are not yet ready for use in complex surgical settings.

Clinical Best Practices

  • Implement a two-phase evaluation framework for assessing LLM performance.
  • Ensure clinical scenarios are high-fidelity.
  • Utilize a weighted evaluation framework.

Related Resources & Content

Original Source(s)

Related Content