CancerLLM: a large language model in cancer domain - Report - MDSpire

CancerLLM: a large language model in cancer domain

  • By

  • Mingchen Li

  • Zaifu Zhan

  • Jiatan Huang

  • Jeremy Yeung

  • Kai Ding

  • Anne Blaes

  • Steven Johnson

  • Hongfang Liu

  • Hua Xu

  • Rui Zhang

  • February 20, 2026

  • 0 min

Share

CancerLLM: A 7B-Parameter Specialized Model for Oncology Phenotyping and Diagnosis

Overview

CancerLLM is a specialized 7-billion-parameter language model trained on extensive oncology clinical notes and pathology reports. It demonstrates superior performance in cancer phenotype extraction and diagnosis generation compared to existing large language models, while also being computationally efficient and robust.

Background

Large language models (LLMs) have shown promise in medical natural language processing tasks but often lack specialization for oncology applications such as cancer phenotyping and diagnosis. Additionally, many existing models have tens of billions of parameters, which pose computational challenges in healthcare environments. To address these gaps, CancerLLM was developed with a focus on cancer-specific data and tasks, aiming to improve accuracy and efficiency in clinical oncology NLP applications.

Data Highlights

MetricCancerLLMExisting LLMsImprovement
Phenotyping Extraction F1 Score91.78%~82.55%+9.23%
Diagnosis Generation F1 Score86.81%Not specifiedHigher than existing LLMs
Model Parameters7 BillionTens of BillionsSmaller size, more efficient
Training Data2.7M Clinical Notes + 515K Pathology ReportsNot specifiedSpecialized oncology data

Key Findings

  • CancerLLM was trained on 2.7 million clinical notes and 515,000 pathology reports covering 17 cancer types.
  • The model achieved an F1 score of 91.78% on cancer phenotype extraction tasks.
  • It reached an F1 score of 86.81% on diagnosis generation tasks.
  • CancerLLM outperformed existing large language models by an average F1 score improvement of 9.23%.
  • With 7 billion parameters, CancerLLM is smaller and more computationally efficient than many existing models with tens of billions of parameters.
  • The model demonstrated robustness and efficiency in terms of time and GPU resource usage.

Clinical Implications

CancerLLM offers a practical and effective tool for oncology clinical research and practice by improving accuracy in cancer phenotyping and diagnosis generation. Its smaller size and computational efficiency make it more feasible for deployment in healthcare settings with limited resources. The model's robustness supports reliable integration into clinical workflows to enhance decision-making and data extraction.

Conclusion

CancerLLM represents a significant advancement in specialized oncology language models, combining high performance with computational efficiency. It holds promise for improving cancer-related NLP tasks and supporting clinical oncology applications.

References

  1. Li et al. 2024 -- CancerLLM: A Specialized Language Model for Oncology Applications
  2. Achiam et al. 2023 -- GPT-4 Technical Report
  3. Touvron et al. 2023 -- Llama 2: Open Foundation and Fine-Tuned Chat Models
  4. Perez-Lopez et al. 2024 -- A Guide to Artificial Intelligence for Cancer Researchers

Original Source(s)

Related Content