Evaluating the accuracy and communication quality of large language models in Ewing sarcoma: a comparative analysis of ChatGPT, Claude, Gemini, DeepSeek, and Grok - Summary - MDSpire

Evaluating the accuracy and communication quality of large language models in Ewing sarcoma: a comparative analysis of ChatGPT, Claude, Gemini, DeepSeek, and Grok

  • By

  • Cihan Ünyılmaz

  • June 30, 2026

  • 0 min

Share

Objective:

To compare the clinical accuracy, comprehensiveness, and communication quality of five widely used large language models (LLMs) in answering frequently asked questions about Ewing sarcoma.

Approach:
  • Evaluation Method: Twelve representative questions were presented to five LLMs, and responses were evaluated by two orthopedic oncology specialists using a 4-point Likert scale for clinical accuracy, completeness, clarity, and relevance.
  • Statistical Analysis: Statistical analyses included Friedman, Wilcoxon signed-rank, Kruskal–Wallis, and Mann–Whitney U tests.
Key Findings:
  • Significant differences were observed among the five LLMs (p < 0.001).
  • ChatGPT achieved the highest overall performance, followed by Claude and DeepSeek.
  • DeepSeek demonstrated the greatest technical accuracy but lower communication quality.
  • ChatGPT provided the best balance between factual correctness and patient-friendly communication.
  • Gemini and Grok produced more superficial responses with lower overall scores.
Interpretation:

Variability among models remains substantial, necessitating further validation and disease-specific optimization.

Limitations:
  • The study's findings are based on a limited number of questions and LLMs.
  • Responses were evaluated by only two specialists, which may affect the reliability of the assessments.
Conclusion:

Original Source(s)

Related Content