Advanced LLMs Enhance Smoking History Extraction for Lung Cancer Monitoring
Overview
This study demonstrates that generative large language models (LLMs) combined with rule-based longitudinal smoothing significantly improve the accuracy of extracting detailed smoking histories from clinical notes. The approach achieved over 96% accuracy across multiple smoking variables and showed robust generalizability across healthcare systems, enabling improved lung cancer surveillance.
Background
Smoking is a major risk factor for lung cancer and other malignancies, with smoking history critical for risk assessment and screening eligibility. Electronic health records (EHRs) often lack comprehensive and accurate smoking data, especially detailed variables like pack-years and quit-years, which are frequently documented only in unstructured clinical notes. Prior natural language processing methods have been limited by single-institution data and lack of longitudinal consistency. Large language models offer a promising solution to extract and harmonize smoking histories across diverse healthcare settings.
Generative LLMs outperformed BERT-based models, achieving over 96% accuracy in extracting seven smoking-related variables from clinical notes.
Longitudinal rule-based smoothing techniques effectively resolved inconsistencies in smoking histories across multiple time points.
External validation across academic and community healthcare systems demonstrated robust generalizability with 97.5–98.8% accuracy.
Deployment on a large lung cancer cohort (4792 patients) showed that risk model-based surveillance incorporating extracted smoking data outperformed NCCN Guidelines in identifying second malignancies.
Smoking history variables extracted included smoking status, pack-years, quit-years, and duration, providing comprehensive longitudinal profiles.
Clinical Implications
Incorporating generative LLMs with longitudinal smoothing into clinical workflows can substantially improve the quality and completeness of smoking history documentation in EHRs. Enhanced smoking data enables more accurate lung cancer risk stratification and surveillance, potentially leading to earlier detection of second malignancies and better patient outcomes. This approach supports broader clinical applications requiring reliable longitudinal smoking information.
Conclusion
The integration of advanced generative LLMs with rule-based longitudinal data harmonization markedly improves extraction of detailed smoking histories from clinical notes, enhancing lung cancer monitoring and risk assessment. This methodology demonstrates strong accuracy and generalizability across healthcare systems, offering a scalable solution to address smoking data quality challenges in EHRs.
References
Stanford and Sutter Health Systems Study, 2024 -- Utilizing advanced language models to retrieve smoking history from clinical documentation for lung cancer monitoring
by Ingrid Luo, Anna Graber-Naidich, Mengrui Zhang, Rakshit Kaushik, Grant M. Nieda, Tony Chen, Bo Gu, Eunji Choi, Victoria Y. Ding, Fatma Gunturkun, Mina Satoyoshi, Archana Bhat, Tae Yoon Lee, Chloe C. Su, Timothy John Ellis-Caleo, A. Solomon Henry, Manisha Desai, Leah M. Backhus, Natalie S. Lui, Ann Leung, Joel W. Neal, Allison W. Kurian, Curtis P. Langlotz, Heather A. Wakelee, Su-Ying Liang, Aparajita Khan, Summer S. Han