Performance of large language models in delivering accurate and comprehensible patient information on heart failure and cardiomyopathy - Summary - MDSpire
Advertisement
Performance of large language models in delivering accurate and comprehensible patient information on heart failure and cardiomyopathy
To benchmark the clinical performance and readability of six leading LLMs in generating responses to patient-oriented questions about heart failure and cardiomyopathies.
Key Findings:
Gemini provided the most readable responses but was among the most verbose.
Gemini received the highest composite mean rating (4.41 ± 0.77), excelling in completeness and factual reliability.
Confabulation avoidance scored consistently high across all models (4.49 ± 0.02), which indicates a strong performance in maintaining factual accuracy.
Conciseness scored the lowest among the evaluated domains (3.81 ± 0.05).
Auto-graders rated the models highest on average, followed by students and then experts.
Interpretation:
All LLMs demonstrated good accuracy in avoiding medical misinformation, though variability exists in readability and comprehensiveness.
Limitations:
Variability in readability and comprehensiveness among LLMs.
Presence of occasional major factual errors or hallucinations.
Conclusion:
The study presents findings on the performance of LLMs for patient-facing applications in cardiovascular health.
by Christoph Reich, Jule Leverenz, Charlotte Brand, Lasse Niemeier, Isabel Branzei, Mustafa Yildirim, Farbod Sedaghat-Hamedani, Ali Amr, Norbert Frey, Benjamin Meder
Despite major advances in guideline-directed medical therapy (GDMT), worsening heart failure continues to drive significant morbidity, repeat hospitalizations and healthcare utilization worldwide.