Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

By
Ingrid Luo
Anna Graber-Naidich
Mengrui Zhang
Rakshit Kaushik
Grant M. Nieda
Tony Chen
Bo Gu
Eunji Choi
Victoria Y. Ding
Fatma Gunturkun
Mina Satoyoshi
Archana Bhat
Tae Yoon Lee
Chloe C. Su
Timothy John Ellis-Caleo
A. Solomon Henry
Manisha Desai
Leah M. Backhus
Natalie S. Lui
Ann Leung
Joel W. Neal
Allison W. Kurian
Curtis P. Langlotz
Heather A. Wakelee
Su-Ying Liang
Aparajita Khan
Summer S. Han
November 28, 2025
0 min

Npj Digital Medicine

Overview

This study demonstrates that generative large language models (LLMs) combined with rule-based longitudinal smoothing significantly improve the accuracy of extracting detailed smoking histories from clinical notes. The approach achieved over 96% accuracy across multiple smoking variables and showed robust generalizability across healthcare systems, enabling improved lung cancer surveillance.

Background

Smoking is a major risk factor for lung cancer and other malignancies, with smoking history critical for risk assessment and screening eligibility. Electronic health records (EHRs) often lack comprehensive and accurate smoking data, especially detailed variables like pack-years and quit-years, which are frequently documented only in unstructured clinical notes. Prior natural language processing methods have been limited by single-institution data and lack of longitudinal consistency. Large language models offer a promising solution to extract and harmonize smoking histories across diverse healthcare settings.

Data Highlights

Model	Accuracy (%)	Variables Evaluated	Notes Annotated	Patients
Generative LLMs (Gemini-1.5-Flash, PaLM-2-Text-Bison, GPT-4)	>96	7 smoking variables	1683	518
External Validation	97.5–98.8	7 smoking variables	Not specified	Not specified
Deployment Cohort (Gemini-1.5-Flash)	Not specified	Longitudinal smoking data	79,408	4792 lung cancer patients

Key Findings

Generative LLMs outperformed BERT-based models, achieving over 96% accuracy in extracting seven smoking-related variables from clinical notes.
Longitudinal rule-based smoothing techniques effectively resolved inconsistencies in smoking histories across multiple time points.
External validation across academic and community healthcare systems demonstrated robust generalizability with 97.5–98.8% accuracy.
Deployment on a large lung cancer cohort (4792 patients) showed that risk model-based surveillance incorporating extracted smoking data outperformed NCCN Guidelines in identifying second malignancies.
Smoking history variables extracted included smoking status, pack-years, quit-years, and duration, providing comprehensive longitudinal profiles.

Clinical Implications

Incorporating generative LLMs with longitudinal smoothing into clinical workflows can substantially improve the quality and completeness of smoking history documentation in EHRs. Enhanced smoking data enables more accurate lung cancer risk stratification and surveillance, potentially leading to earlier detection of second malignancies and better patient outcomes. This approach supports broader clinical applications requiring reliable longitudinal smoking information.

Conclusion

The integration of advanced generative LLMs with rule-based longitudinal data harmonization markedly improves extraction of detailed smoking histories from clinical notes, enhancing lung cancer monitoring and risk assessment. This methodology demonstrates strong accuracy and generalizability across healthcare systems, offering a scalable solution to address smoking data quality challenges in EHRs.

References

Stanford and Sutter Health Systems Study, 2024 -- Utilizing advanced language models to retrieve smoking history from clinical documentation for lung cancer monitoring

Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

Advanced LLMs Enhance Smoking History Extraction for Lung Cancer Monitoring

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

Related Content

Exploring Lung Cancer Treatment Strategies within Brazil's Largest Private Health Insurance: Insights from Real-World Data

Exploring Inequities in Lung Cancer Management: A Comprehensive Spatio-Temporal Study of Multidisciplinary Meeting Presentations, Supportive Care Assessments, and Diagnostic Timeliness in Victoria

ShapeField-lung: continuous shape embedding for early lung cancer detection via pulmonary nodule segmentation