Clinical evaluation of large language model recommendations in melanoma: comparison with multidisciplinary tumor board decisions in a real-world cohort - Summary - MDSpire
Advertisement
Clinical evaluation of large language model recommendations in melanoma: comparison with multidisciplinary tumor board decisions in a real-world cohort
To evaluate the performance of four large language models (LLMs) in generating melanoma treatment recommendations compared to real-world decisions made by a multidisciplinary tumor board (MDT).
Approach:
Study Design: Retrospective single-center study involving 151 patients with newly diagnosed cutaneous melanoma discussed at the MDT.
LLM Evaluation: Recommendations from four LLMs (ChatGPT-4o, ChatGPT-5 Thinking, Gemini 2.5 Pro, DeepSeek-V3.2) were compared against actual MDT decisions by four board-certified oncologists.
Rating Domains: LLM-generated recommendations were rated on clarity, clinical applicability, coverage, explanation and support with evidence, and guideline concordance.
Key Findings:
Inter-rater reliability among oncologists was acceptable to good.
ChatGPT-5 Thinking demonstrated the strongest overall performance among the LLMs.
Statistically significant performance differences were observed across all evaluated domains.
Performance differences were most clinically relevant in complex treatment scenarios.
Interpretation:
Selected LLMs may support melanoma MDT practice in resource-limited settings.
Limitations:
The study is retrospective and conducted at a single center.
Further prospective studies are needed to validate LLM-assisted treatment recommendations.
Conclusion:
While LLMs show potential as supportive tools in melanoma treatment decision-making.