Semi-supervised Transformer Model Enhances Rare Pulmonary Disease Diagnosis from EHRs
Overview
The WEST (WEakly Supervised Transformer) model leverages limited expert-labeled data combined with extensive probabilistic labels from EHRs to improve diagnosis and subphenotyping of rare pulmonary diseases. Evaluated on Boston Children’s Hospital data, WEST outperforms existing methods in phenotype classification, subphenotype identification, and disease progression prediction.
Background
Rare diseases affect hundreds of millions globally but remain underdiagnosed due to low prevalence and limited clinician familiarity. Computational phenotyping using electronic health records (EHRs) offers scalable detection but is hindered by scarce high-quality labeled data. Expert-validated labels are accurate but limited, while EHR-derived labels are broader but noisy. Integrating these data sources efficiently is critical to improving rare disease diagnosis and characterization.
Data Highlights
Metric
WEST Model Performance
Existing Methods
Phenotype Classification Accuracy
Higher
Lower
Subphenotype Identification
Improved
Less Accurate
Disease Progression Prediction
Enhanced
Inferior
Key Findings
WEST uses a weakly supervised transformer trained on limited expert labels and extensive silver-standard probabilistic labels from EHR data.
Iterative refinement of probabilistic labels during training improves model calibration and accuracy.
Applied to two rare pulmonary diseases, WEST outperforms existing computational phenotyping methods in classification and subphenotyping.
WEST enables prediction of disease progression, providing deeper clinical insights from routine EHR data.
The approach reduces reliance on manual annotation, enhancing label efficiency in rare disease representation learning.
Clinical Implications
WEST offers a practical tool for clinicians to improve detection and characterization of rare pulmonary diseases using routinely collected EHR data. By combining limited expert annotations with broader EHR-derived labels, it facilitates earlier and more accurate diagnosis and supports personalized disease management through subphenotyping and progression prediction.
Conclusion
The WEST framework demonstrates that semi-supervised transformer models can effectively leverage heterogeneous EHR data to advance rare disease diagnosis and subphenotyping, addressing key challenges in clinical phenotyping for low-prevalence conditions.