Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration - Scorecard - MDSpire

Two-Phase Blinded Assessment of Large Language Models in Complex Cardiac Surgery: Evaluating Task-Specific Efficacy and Collaboration with Clinicians

By
Marc Leon
Ruibin Feng
Manuel Quiroz Flores
Glenn Pelletier
Daniel Bethencourt
Masafumi Shibata
Hao He
Chawannuch Ruaengsri
May 29, 2026

Frontiers In Digital Health

Share

Clinical Scorecard: Two-Phase Blinded Assessment of Large Language Models in Complex Cardiac Surgery: Evaluating Task-Specific Efficacy and Collaboration with Clinicians

At a Glance

Category	Detail
Condition
Key Mechanisms	Evaluation of large language models (LLMs) in surgical decision-making and human-LLM collaboration.
Target Population
Care Setting

Key Highlights

LLM performance varied, with O1 scoring highest (0.896) and Llama3-OpenBioLLM-70B lowest (0.521).
Scenario comprehension scored highest (0.920), while patient safety scored lowest (0.507).
Second-round ratings showed a decline for four LLMs.

Guideline-Based Recommendations

Diagnosis

Rigorous validation of LLMs is necessary for safe clinical integration.

Management

Evaluate LLM outputs critically.

Monitoring & Follow-up

Assess human-LLM collaboration.

Risks

Potential for clinicians to over-accept model reasoning.

Patient & Prescribing Data

LLMs are not yet ready for use in complex surgical settings.

Clinical Best Practices

Implement a two-phase evaluation framework for assessing LLM performance.
Ensure clinical scenarios are high-fidelity.
Utilize a weighted evaluation framework.

Related Resources & Content

Study on LLMs in Cardiac Surgery

Original Source(s)

Frontiers In Digital Health

Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration

by Marc Leon, Ruibin Feng, Manuel Quiroz Flores, Glenn Pelletier, Daniel Bethencourt, Masafumi Shibata, Hao He, Chawannuch Ruaengsri
May 29, 2026

Related Content

Frontiers In Oncology

Beyond the platform: why surgical quality and surgeon proficiency outweigh the RATS vs. VATS debate in early-stage lung cancer

by Marcello Migliore, Kwon Joong Na
July 15, 2026

Bmj Health & Care Informatics

Open-source large language model-based on-premises pipeline for automated data extraction from unstructured electronic health records: a pilot study

Conexiant

ASE Details M-TEER Imaging Guidance

The guideline emphasizes standardized image orientation and heart team communication during mitral valve transcatheter edge-to-edge repair.

by Conexiant News Staff
June 22, 2026
9 min