Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance - Report - MDSpire

Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

  • By

  • Ingrid Luo

  • Anna Graber-Naidich

  • Mengrui Zhang

  • Rakshit Kaushik

  • Grant M. Nieda

  • Tony Chen

  • Bo Gu

  • Eunji Choi

  • Victoria Y. Ding

  • Fatma Gunturkun

  • Mina Satoyoshi

  • Archana Bhat

  • Tae Yoon Lee

  • Chloe C. Su

  • Timothy John Ellis-Caleo

  • A. Solomon Henry

  • Manisha Desai

  • Leah M. Backhus

  • Natalie S. Lui

  • Ann Leung

  • Joel W. Neal

  • Allison W. Kurian

  • Curtis P. Langlotz

  • Heather A. Wakelee

  • Su-Ying Liang

  • Aparajita Khan

  • Summer S. Han

  • November 28, 2025

  • 0 min

Share

Advanced LLMs Enhance Smoking History Extraction for Lung Cancer Monitoring

Overview

This study demonstrates that generative large language models (LLMs) combined with rule-based longitudinal smoothing significantly improve the accuracy of extracting detailed smoking histories from clinical notes. The approach achieved over 96% accuracy across multiple smoking variables and showed robust generalizability across healthcare systems, enabling improved lung cancer surveillance.

Background

Smoking is a major risk factor for lung cancer and other malignancies, with smoking history critical for risk assessment and screening eligibility. Electronic health records (EHRs) often lack comprehensive and accurate smoking data, especially detailed variables like pack-years and quit-years, which are frequently documented only in unstructured clinical notes. Prior natural language processing methods have been limited by single-institution data and lack of longitudinal consistency. Large language models offer a promising solution to extract and harmonize smoking histories across diverse healthcare settings.

Data Highlights

ModelAccuracy (%)Variables EvaluatedNotes AnnotatedPatients
Generative LLMs (Gemini-1.5-Flash, PaLM-2-Text-Bison, GPT-4)>967 smoking variables1683518
External Validation97.5–98.87 smoking variablesNot specifiedNot specified
Deployment Cohort (Gemini-1.5-Flash)Not specifiedLongitudinal smoking data79,4084792 lung cancer patients

Key Findings

  • Generative LLMs outperformed BERT-based models, achieving over 96% accuracy in extracting seven smoking-related variables from clinical notes.
  • Longitudinal rule-based smoothing techniques effectively resolved inconsistencies in smoking histories across multiple time points.
  • External validation across academic and community healthcare systems demonstrated robust generalizability with 97.5–98.8% accuracy.
  • Deployment on a large lung cancer cohort (4792 patients) showed that risk model-based surveillance incorporating extracted smoking data outperformed NCCN Guidelines in identifying second malignancies.
  • Smoking history variables extracted included smoking status, pack-years, quit-years, and duration, providing comprehensive longitudinal profiles.

Clinical Implications

Incorporating generative LLMs with longitudinal smoothing into clinical workflows can substantially improve the quality and completeness of smoking history documentation in EHRs. Enhanced smoking data enables more accurate lung cancer risk stratification and surveillance, potentially leading to earlier detection of second malignancies and better patient outcomes. This approach supports broader clinical applications requiring reliable longitudinal smoking information.

Conclusion

The integration of advanced generative LLMs with rule-based longitudinal data harmonization markedly improves extraction of detailed smoking histories from clinical notes, enhancing lung cancer monitoring and risk assessment. This methodology demonstrates strong accuracy and generalizability across healthcare systems, offering a scalable solution to address smoking data quality challenges in EHRs.

References

  1. Stanford and Sutter Health Systems Study, 2024 -- Utilizing advanced language models to retrieve smoking history from clinical documentation for lung cancer monitoring

Original Source(s)

Related Content