Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance - Scorecard - MDSpire

Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

  • By

  • Ingrid Luo

  • Anna Graber-Naidich

  • Mengrui Zhang

  • Rakshit Kaushik

  • Grant M. Nieda

  • Tony Chen

  • Bo Gu

  • Eunji Choi

  • Victoria Y. Ding

  • Fatma Gunturkun

  • Mina Satoyoshi

  • Archana Bhat

  • Tae Yoon Lee

  • Chloe C. Su

  • Timothy John Ellis-Caleo

  • A. Solomon Henry

  • Manisha Desai

  • Leah M. Backhus

  • Natalie S. Lui

  • Ann Leung

  • Joel W. Neal

  • Allison W. Kurian

  • Curtis P. Langlotz

  • Heather A. Wakelee

  • Su-Ying Liang

  • Aparajita Khan

  • Summer S. Han

  • November 28, 2025

  • 0 min

Share

Clinical Scorecard: Utilizing advanced language models to retrieve smoking history from clinical documentation for lung cancer monitoring

At a Glance

CategoryDetail
ConditionLung cancer and smoking-related health outcomes
Key MechanismsExtraction of comprehensive smoking history from clinical notes using large language models (LLMs) combined with rule-based longitudinal smoothing
Target PopulationLung cancer patients and individuals at risk due to smoking history
Care SettingAcademic and community-based healthcare systems with electronic health records (EHRs)

Key Highlights

  • Generative LLMs (Gemini-1.5-Flash, PaLM-2-Text-Bison, GPT-4) achieved >96% accuracy extracting seven smoking variables from clinical notes.
  • Longitudinal smoothing techniques resolve inconsistencies in smoking history across multiple time points.
  • Risk model-based surveillance incorporating enhanced smoking data outperformed NCCN Guidelines in identifying second malignancies.

Guideline-Based Recommendations

Diagnosis

  • Use detailed smoking history including pack-years, duration, and quit-years for lung cancer risk assessment.
  • Incorporate longitudinal smoking data rather than single time-point status for accurate risk stratification.

Management

  • Apply automated LLM-based extraction methods to improve smoking documentation quality in EHRs.
  • Use enhanced smoking history data to guide lung cancer screening eligibility and surveillance strategies.

Monitoring & Follow-up

  • Implement longitudinal data smoothing to identify and correct inconsistencies in smoking history over time.
  • Regularly update smoking status and related variables to inform ongoing risk assessment.

Risks

  • Be aware of potential inaccuracies and hallucinations in LLM-extracted data; apply rule-based corrections.
  • Recognize that incomplete or inconsistent smoking documentation can impair risk assessment and surveillance.

Patient & Prescribing Data

Lung cancer patients with documented smoking histories across multiple healthcare systems

Enhanced smoking history extraction enables improved identification of patients eligible for lung cancer screening and surveillance, potentially reducing missed second malignancies.

Clinical Best Practices

  • Combine generative LLMs with rule-based longitudinal smoothing for robust smoking history extraction.
  • Validate extraction models across diverse healthcare settings to ensure generalizability.
  • Use comprehensive smoking variables beyond status (e.g., pack-years, quit-years) for clinical decision-making.
  • Incorporate smoking history data into risk models to optimize lung cancer monitoring and follow-up.

References

Original Source(s)

Related Content