Detection of cancer recurrence from Thai-English electronic medical records using sentence embeddings - Summary - MDSpire

Detection of cancer recurrence from Thai-English electronic medical records using sentence embeddings

  • By

  • Ekapob Sangariyavanich

  • Wanchana Ponthongmak

  • Nawanan Theera-Ampornpunt

  • Nat Tangchitnob

  • Gareth J McKay

  • Ammarin Thakkinstian

  • July 2, 2026

  • 0 min

Share

Objective:

To develop and validate monolingual and bilingual SBERT models for detecting cancer recurrence within Thai-English electronic medical records from Thai cancer hospitals.

Approach:
  • Model Development: Utilized a multicentre dataset of 32,436 documents from 1,250 patients for model development.
  • External Validation: Conducted external validation with an independent dataset of 9,244 documents from 384 patients across two Thai cancer hospitals.
  • Performance Benchmarking: Performance was benchmarked against a fine-tuned PubMedBERT (MetBERT).
Key Findings:
  • MetBERT achieved the highest AUPRC for locoregional versus no recurrence (11.1%) and locoregional versus distant recurrence (91.7%).
  • Monolingual-SBERT excelled at distant versus no recurrence (32.0%).
  • Bilingual-SBERT performed best for distant versus no recurrence in external validation (AUPRC 17.55%–24.39%).
  • MetBERT led in distinguishing locoregional versus distant recurrence (88.30%–94.70%).
  • Bilingual-SBERT demonstrated robust external validation performance (AUPRC 85.25%–91.80%).
Interpretation:

Low AUPRC values (9%–32%) indicate extreme class imbalance in real-world data (~1% recurrence prevalence). Despite this, fine-tuned MetBERT achieved the highest performance, while bilingual-SBERT showed superior robustness during external validation.

Limitations:
  • Low AUPRC values reflect the extreme class imbalance in the dataset.
  • Text-length constraints may affect model performance.
Conclusion:

Sentence embedding frameworks provide a practical, generalisable solution for detecting cancer recurrence within multilingual EMRs, suitable for clinical integration as a screening tool for cancer registry workflows.

Original Source(s)

Related Content