Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

By
Ingrid Luo
Anna Graber-Naidich
Mengrui Zhang
Rakshit Kaushik
Grant M. Nieda
Tony Chen
Bo Gu
Eunji Choi
Victoria Y. Ding
Fatma Gunturkun
Mina Satoyoshi
Archana Bhat
Tae Yoon Lee
Chloe C. Su
Timothy John Ellis-Caleo
A. Solomon Henry
Manisha Desai
Leah M. Backhus
Natalie S. Lui
Ann Leung
Joel W. Neal
Allison W. Kurian
Curtis P. Langlotz
Heather A. Wakelee
Su-Ying Liang
Aparajita Khan
Summer S. Han
November 28, 2025
0 min

Npj Digital Medicine

At a Glance

Category	Detail
Condition	Lung cancer and smoking-related health outcomes
Key Mechanisms	Extraction of comprehensive smoking history from clinical notes using large language models (LLMs) combined with rule-based longitudinal smoothing
Target Population	Lung cancer patients and individuals at risk due to smoking history
Care Setting	Academic and community-based healthcare systems with electronic health records (EHRs)

Key Highlights

Generative LLMs (Gemini-1.5-Flash, PaLM-2-Text-Bison, GPT-4) achieved >96% accuracy extracting seven smoking variables from clinical notes.
Longitudinal smoothing techniques resolve inconsistencies in smoking history across multiple time points.
Risk model-based surveillance incorporating enhanced smoking data outperformed NCCN Guidelines in identifying second malignancies.

Guideline-Based Recommendations

Diagnosis

Use detailed smoking history including pack-years, duration, and quit-years for lung cancer risk assessment.
Incorporate longitudinal smoking data rather than single time-point status for accurate risk stratification.

Management

Apply automated LLM-based extraction methods to improve smoking documentation quality in EHRs.
Use enhanced smoking history data to guide lung cancer screening eligibility and surveillance strategies.

Monitoring & Follow-up

Implement longitudinal data smoothing to identify and correct inconsistencies in smoking history over time.
Regularly update smoking status and related variables to inform ongoing risk assessment.

Risks

Be aware of potential inaccuracies and hallucinations in LLM-extracted data; apply rule-based corrections.
Recognize that incomplete or inconsistent smoking documentation can impair risk assessment and surveillance.

Patient & Prescribing Data

Lung cancer patients with documented smoking histories across multiple healthcare systems

Enhanced smoking history extraction enables improved identification of patients eligible for lung cancer screening and surveillance, potentially reducing missed second malignancies.

Clinical Best Practices

Combine generative LLMs with rule-based longitudinal smoothing for robust smoking history extraction.
Validate extraction models across diverse healthcare settings to ensure generalizability.
Use comprehensive smoking variables beyond status (e.g., pack-years, quit-years) for clinical decision-making.
Incorporate smoking history data into risk models to optimize lung cancer monitoring and follow-up.

Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

Clinical Scorecard: Utilizing advanced language models to retrieve smoking history from clinical documentation for lung cancer monitoring

At a Glance

Key Highlights

Guideline-Based Recommendations

Diagnosis

Management

Monitoring & Follow-up

Risks

Patient & Prescribing Data

Clinical Best Practices

References

Original Source(s)

Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

Related Content

Enhancing Access to Lung Cancer Screening with Emphasis on Health Disparities: Insights from Professionals in the UK NHS Lung Cancer Screening Initiative

Tandospirone augments cisplatin treatment by lowering cholesterol and managing distress in NSCLC patients

Real-World Outcomes of Immune Checkpoint Inhibitors in Lung Cancer: A Study from a Middle-Income Nation