A weakly supervised transformer for rare disease diagnosis and subphenotyping from EHRs with pulmonary case studies - Report - MDSpire

A weakly supervised transformer for rare disease diagnosis and subphenotyping from EHRs with pulmonary case studies

  • By

  • Kimberly F. Greco

  • Zongxin Yang

  • Mengyan Li

  • Han Tong

  • Sara Morini Sweet

  • Alon Geva

  • Kenneth D. Mandl

  • Benjamin A. Raby

  • Tianxi Cai

  • February 6, 2026

  • 0 min

Share

Semi-supervised Transformer Model Enhances Rare Pulmonary Disease Diagnosis from EHRs

Overview

The WEST (WEakly Supervised Transformer) model leverages limited expert-labeled data combined with extensive probabilistic labels from EHRs to improve diagnosis and subphenotyping of rare pulmonary diseases. Evaluated on Boston Children’s Hospital data, WEST outperforms existing methods in phenotype classification, subphenotype identification, and disease progression prediction.

Background

Rare diseases affect hundreds of millions globally but remain underdiagnosed due to low prevalence and limited clinician familiarity. Computational phenotyping using electronic health records (EHRs) offers scalable detection but is hindered by scarce high-quality labeled data. Expert-validated labels are accurate but limited, while EHR-derived labels are broader but noisy. Integrating these data sources efficiently is critical to improving rare disease diagnosis and characterization.

Data Highlights

MetricWEST Model PerformanceExisting Methods
Phenotype Classification AccuracyHigherLower
Subphenotype IdentificationImprovedLess Accurate
Disease Progression PredictionEnhancedInferior

Key Findings

  • WEST uses a weakly supervised transformer trained on limited expert labels and extensive silver-standard probabilistic labels from EHR data.
  • Iterative refinement of probabilistic labels during training improves model calibration and accuracy.
  • Applied to two rare pulmonary diseases, WEST outperforms existing computational phenotyping methods in classification and subphenotyping.
  • WEST enables prediction of disease progression, providing deeper clinical insights from routine EHR data.
  • The approach reduces reliance on manual annotation, enhancing label efficiency in rare disease representation learning.

Clinical Implications

WEST offers a practical tool for clinicians to improve detection and characterization of rare pulmonary diseases using routinely collected EHR data. By combining limited expert annotations with broader EHR-derived labels, it facilitates earlier and more accurate diagnosis and supports personalized disease management through subphenotyping and progression prediction.

Conclusion

The WEST framework demonstrates that semi-supervised transformer models can effectively leverage heterogeneous EHR data to advance rare disease diagnosis and subphenotyping, addressing key challenges in clinical phenotyping for low-prevalence conditions.

References

  1. Greco et al. 2024 -- A semi-supervised transformer model for diagnosing rare diseases and subphenotyping using electronic health records
  2. Health, T. L. G. 2024 -- The landscape for rare diseases in 2024
  3. Wang, C. M. et al. 2024 -- Operational description of rare diseases: a reference to improve the recognition and visibility of rare diseases
  4. Marwaha, S., Knowles, J. W. & Ashley, E. A. 2022 -- A guide for the diagnosis of rare and undiagnosed disease: beyond the exome

Original Source(s)

Related Content