Clinical evaluation of large language model recommendations in melanoma: comparison with multidisciplinary tumor board decisions in a real-world cohort - Report - MDSpire
Advertisement
Clinical evaluation of large language model recommendations in melanoma: comparison with multidisciplinary tumor board decisions in a real-world cohort
Clinical Report: Assessment of Large Language Model Suggestions in Melanoma
Overview
This study evaluates the performance of four large language models (LLMs) in generating treatment recommendations for melanoma compared to a multidisciplinary tumor board's decisions.
Background
Malignant melanoma is a significant global health challenge, with rising incidence rates and a need for effective treatment strategies. Multidisciplinary tumor boards (MDTs) play a crucial role in decision-making for melanoma management, particularly in resource-limited settings. The integration of large language models (LLMs) into this process requires thorough evaluation.
Data Highlights
LLM
Performance Rating
ChatGPT-5 Thinking
Strongest
ChatGPT-4o
Moderate
Gemini 2.5 Pro
Less Favorable
DeepSeek-V3.2
Least Favorable
Key Findings
Inter-rater reliability among oncologists was acceptable to good.
ChatGPT-5 Thinking showed consistent performance across evaluated domains.
Statistically significant differences were observed between the LLMs in all domains assessed.
Performance differences were most relevant in complex treatment scenarios.
LLM-generated recommendations should not replace independent treatment decisions.
Clinical Implications
The findings indicate that LLMs may have a role in melanoma treatment decision-making, but their recommendations should be used as supportive tools rather than as standalone treatment decisions.
Conclusion
This study emphasizes the need for further research before LLMs can be integrated into clinical workflows.