Performance of large language models in delivering accurate and comprehensible patient information on heart failure and cardiomyopathy - Summary - MDSpire

Performance of large language models in delivering accurate and comprehensible patient information on heart failure and cardiomyopathy

  • By

  • Christoph Reich

  • Jule Leverenz

  • Charlotte Brand

  • Lasse Niemeier

  • Isabel Branzei

  • Mustafa Yildirim

  • Farbod Sedaghat-Hamedani

  • Ali Amr

  • Norbert Frey

  • Benjamin Meder

  • June 9, 2026

  • 0 min

Share

Objective:

To benchmark the clinical performance and readability of six leading LLMs in generating responses to patient-oriented questions about heart failure and cardiomyopathies.

Key Findings:
  • Gemini provided the most readable responses but was among the most verbose.
  • Gemini received the highest composite mean rating (4.41 ± 0.77), excelling in completeness and factual reliability.
  • Confabulation avoidance scored consistently high across all models (4.49 ± 0.02), which indicates a strong performance in maintaining factual accuracy.
  • Conciseness scored the lowest among the evaluated domains (3.81 ± 0.05).
  • Auto-graders rated the models highest on average, followed by students and then experts.
Interpretation:

All LLMs demonstrated good accuracy in avoiding medical misinformation, though variability exists in readability and comprehensiveness.

Limitations:
  • Variability in readability and comprehensiveness among LLMs.
  • Presence of occasional major factual errors or hallucinations.
Conclusion:

The study presents findings on the performance of LLMs for patient-facing applications in cardiovascular health.

Original Source(s)

Related Content