CancerLLM: a large language model in cancer domain

By
Mingchen Li
Zaifu Zhan
Jiatan Huang
Jeremy Yeung
Kai Ding
Anne Blaes
Steven Johnson
Hongfang Liu
Hua Xu
Rui Zhang
February 20, 2026
0 min

Npj Digital Medicine

Overview

CancerLLM is a specialized 7-billion-parameter language model trained on extensive oncology clinical notes and pathology reports. It demonstrates superior performance in cancer phenotype extraction and diagnosis generation compared to existing large language models, while also being computationally efficient and robust.

Background

Large language models (LLMs) have shown promise in medical natural language processing tasks but often lack specialization for oncology applications such as cancer phenotyping and diagnosis. Additionally, many existing models have tens of billions of parameters, which pose computational challenges in healthcare environments. To address these gaps, CancerLLM was developed with a focus on cancer-specific data and tasks, aiming to improve accuracy and efficiency in clinical oncology NLP applications.

Data Highlights

Metric	CancerLLM	Existing LLMs	Improvement
Phenotyping Extraction F1 Score	91.78%	~82.55%	+9.23%
Diagnosis Generation F1 Score	86.81%	Not specified	Higher than existing LLMs
Model Parameters	7 Billion	Tens of Billions	Smaller size, more efficient
Training Data	2.7M Clinical Notes + 515K Pathology Reports	Not specified	Specialized oncology data

Key Findings

CancerLLM was trained on 2.7 million clinical notes and 515,000 pathology reports covering 17 cancer types.
The model achieved an F1 score of 91.78% on cancer phenotype extraction tasks.
It reached an F1 score of 86.81% on diagnosis generation tasks.
CancerLLM outperformed existing large language models by an average F1 score improvement of 9.23%.
With 7 billion parameters, CancerLLM is smaller and more computationally efficient than many existing models with tens of billions of parameters.
The model demonstrated robustness and efficiency in terms of time and GPU resource usage.

Clinical Implications

CancerLLM offers a practical and effective tool for oncology clinical research and practice by improving accuracy in cancer phenotyping and diagnosis generation. Its smaller size and computational efficiency make it more feasible for deployment in healthcare settings with limited resources. The model's robustness supports reliable integration into clinical workflows to enhance decision-making and data extraction.

Conclusion

CancerLLM represents a significant advancement in specialized oncology language models, combining high performance with computational efficiency. It holds promise for improving cancer-related NLP tasks and supporting clinical oncology applications.

CancerLLM: a large language model in cancer domain

CancerLLM: A 7B-Parameter Specialized Model for Oncology Phenotyping and Diagnosis

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

CancerLLM: a large language model in cancer domain

Related Content

Trends in Intrahepatic and Bile Duct Cancers in the United States from 1999 to 2023

Medical Oddities: Something Viral is Lurking in the Dust

Silencing TMED2 suppresses cell growth and tumor progression in diffuse large B-cell lymphoma via inducing G0/G1 cell cycle arrest