Evaluating the accuracy and communication quality of large language models in Ewing sarcoma: a comparative analysis of ChatGPT, Claude, Gemini, DeepSeek, and Grok - Report - MDSpire
Advertisement
Evaluating the accuracy and communication quality of large language models in Ewing sarcoma: a comparative analysis of ChatGPT, Claude, Gemini, DeepSeek, and Grok
Clinical Report: Assessing the Precision and Communication Effectiveness of LLMs
Overview
This study evaluates the performance of five large language models (LLMs) in providing information about Ewing sarcoma.
Background
Ewing sarcoma is a rare and aggressive pediatric cancer that requires complex management involving multidisciplinary teams. Accurate communication is critical, as families seek reliable information about diagnosis, treatment options, and prognosis. The use of LLMs for medical information necessitates assessment of their effectiveness in delivering quality education.
Data Highlights
Model
Overall Performance
Technical Accuracy
Communication Quality
ChatGPT
Highest
Moderate
Best
Claude
Second
Moderate
Good
DeepSeek
Third
Highest
Lower
Gemini
Lower
Low
Low
Grok
Lowest
Low
Low
Key Findings
ChatGPT achieved the highest overall performance among the LLMs evaluated.
DeepSeek demonstrated the greatest technical accuracy but lower communication quality.
Gemini and Grok produced more superficial responses with lower overall scores.
Significant differences in performance were observed among the five LLMs (p < 0.001).
Clinical Implications
Current LLMs can support patient education but should not replace specialist consultation.
Conclusion
This study emphasizes the need for careful evaluation and validation of LLMs before their routine use in clinical practice.
Harold Burstein, MD, PhD, and Ana C. Garrido-Castro, MD discuss results from the Saci-IO HR+ trial, which were presented at the 2026 ESMO Breast Cancer Congress.