Evaluating the accuracy and communication quality of large language models in Ewing sarcoma: a comparative analysis of ChatGPT, Claude, Gemini, DeepSeek, and Grok - Summary - MDSpire
Advertisement
Evaluating the accuracy and communication quality of large language models in Ewing sarcoma: a comparative analysis of ChatGPT, Claude, Gemini, DeepSeek, and Grok
To compare the clinical accuracy, comprehensiveness, and communication quality of five widely used large language models (LLMs) in answering frequently asked questions about Ewing sarcoma.
Approach:
Evaluation Method: Twelve representative questions were presented to five LLMs, and responses were evaluated by two orthopedic oncology specialists using a 4-point Likert scale for clinical accuracy, completeness, clarity, and relevance.
Statistical Analysis: Statistical analyses included Friedman, Wilcoxon signed-rank, Kruskal–Wallis, and Mann–Whitney U tests.
Key Findings:
Significant differences were observed among the five LLMs (p < 0.001).
ChatGPT achieved the highest overall performance, followed by Claude and DeepSeek.
DeepSeek demonstrated the greatest technical accuracy but lower communication quality.
ChatGPT provided the best balance between factual correctness and patient-friendly communication.
Gemini and Grok produced more superficial responses with lower overall scores.
Interpretation:
Variability among models remains substantial, necessitating further validation and disease-specific optimization.
Limitations:
The study's findings are based on a limited number of questions and LLMs.
Responses were evaluated by only two specialists, which may affect the reliability of the assessments.
Harold Burstein, MD, PhD, and Ana C. Garrido-Castro, MD discuss results from the Pumitamig + DB-1305/BNT325 trial, which were presented at the 2026 ESMO Breast Cancer Congress.
Harold Burstein, MD, PhD, and Ana C. Garrido-Castro, MD discuss results from the Saci-IO HR+ trial, which were presented at the 2026 ESMO Breast Cancer Congress.