Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

By
Ingrid Luo
Anna Graber-Naidich
Mengrui Zhang
Rakshit Kaushik
Grant M. Nieda
Tony Chen
Bo Gu
Eunji Choi
Victoria Y. Ding
Fatma Gunturkun
Mina Satoyoshi
Archana Bhat
Tae Yoon Lee
Chloe C. Su
Timothy John Ellis-Caleo
A. Solomon Henry
Manisha Desai
Leah M. Backhus
Natalie S. Lui
Ann Leung
Joel W. Neal
Allison W. Kurian
Curtis P. Langlotz
Heather A. Wakelee
Su-Ying Liang
Aparajita Khan
Summer S. Han
November 28, 2025
0 min

Npj Digital Medicine

Objective:

To enhance the quality of smoking history documentation in electronic health records (EHRs) using large language models (LLMs) for improved lung cancer monitoring, focusing on accuracy and completeness.

Key Findings:

Generative LLMs achieved > 96% accuracy across seven key smoking-related variables, including smoking status and history.
External validation showed robust generalizability with 97.5–98.8% accuracy across diverse patient populations.
Risk model-based surveillance incorporating smoking factors outperformed NCCN Guidelines in identifying second malignancies.

Interpretation:

The study demonstrates that generative LLMs can significantly improve the accuracy and completeness of smoking history documentation, which is critical for lung cancer surveillance and patient monitoring.

Limitations:

The study may be limited by the specific healthcare systems involved, which could affect the generalizability of the findings to other settings.
Potential LLM hallucinations were not systematically addressed in longitudinal contexts, raising concerns about reliability.

Conclusion:

Generative LLMs represent a promising advancement in extracting and harmonizing smoking histories from clinical documentation, which is crucial for enhancing lung cancer monitoring and improving patient outcomes.

Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

Objective:

Key Findings:

Interpretation:

Limitations:

Conclusion:

Original Source(s)

Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance

Related Content

Quantifying Early-Stage Lung Adenocarcinoma Progression with a Radiomic Trajectory

Predicting Invasiveness of Lung Adenocarcinoma from Chest CT with Few-shot Vision-Language Ternary Classification Model

Tandospirone augments cisplatin treatment by lowering cholesterol and managing distress in NSCLC patients