To develop and validate MSIC-Bench, a novel benchmark specifically designed for evaluating large language models (LLMs) in the context of microsatellite instability (MSI) cancer care, and to systematically assess the capabilities and limitations of state-of-the-art LLMs.
Key Findings:
Standard LLMs exhibit a significant deficit in specialized knowledge.
RAG shifts the bottleneck from knowledge to information retrieval, introducing 'retrieval failure' as a new dominant error mode.
RAG systems can transform high-risk fabrications into safer refusals but may also introduce 'false refusals' (incorrect denials of information), which degrade utility.
Integrating broad clinical guidelines with specialized knowledge in RAG architectures offers a practical solution for improving LLM performance.
Interpretation:
The study highlights the current capabilities and limitations of LLMs in oncology, providing a roadmap for their future development and safe clinical integration, with implications for improving patient care.
Limitations:
The study primarily focuses on a limited number of LLMs (three) and prompting strategies (four).
The evaluation may not encompass all potential clinical scenarios or MSI-related complexities.
Conclusion:
The findings provide actionable insights for developing more robust LLM systems in MSI cancer care.