To develop and validate monolingual and bilingual SBERT models for detecting cancer recurrence within Thai-English electronic medical records from Thai cancer hospitals.
Approach:
Model Development: Utilized a multicentre dataset of 32,436 documents from 1,250 patients for model development.
External Validation: Conducted external validation with an independent dataset of 9,244 documents from 384 patients across two Thai cancer hospitals.
Performance Benchmarking: Performance was benchmarked against a fine-tuned PubMedBERT (MetBERT).
Key Findings:
MetBERT achieved the highest AUPRC for locoregional versus no recurrence (11.1%) and locoregional versus distant recurrence (91.7%).
Monolingual-SBERT excelled at distant versus no recurrence (32.0%).
Bilingual-SBERT performed best for distant versus no recurrence in external validation (AUPRC 17.55%–24.39%).
MetBERT led in distinguishing locoregional versus distant recurrence (88.30%–94.70%).
Low AUPRC values (9%–32%) indicate extreme class imbalance in real-world data (~1% recurrence prevalence). Despite this, fine-tuned MetBERT achieved the highest performance, while bilingual-SBERT showed superior robustness during external validation.
Limitations:
Low AUPRC values reflect the extreme class imbalance in the dataset.
Text-length constraints may affect model performance.
Conclusion:
Sentence embedding frameworks provide a practical, generalisable solution for detecting cancer recurrence within multilingual EMRs, suitable for clinical integration as a screening tool for cancer registry workflows.