Detection of cancer recurrence from Thai-English electronic medical records using sentence embeddings

By
Ekapob Sangariyavanich
Wanchana Ponthongmak
Nawanan Theera-Ampornpunt
Nat Tangchitnob
Gareth J McKay
Ammarin Thakkinstian
July 2, 2026
0 min

Bmj Health & Care Informatics

Objective:

To develop and validate monolingual and bilingual SBERT models for detecting cancer recurrence within Thai-English electronic medical records from Thai cancer hospitals.

Approach:

Model Development: Utilized a multicentre dataset of 32,436 documents from 1,250 patients for model development.
External Validation: Conducted external validation with an independent dataset of 9,244 documents from 384 patients across two Thai cancer hospitals.
Performance Benchmarking: Performance was benchmarked against a fine-tuned PubMedBERT (MetBERT).

Key Findings:

MetBERT achieved the highest AUPRC for locoregional versus no recurrence (11.1%) and locoregional versus distant recurrence (91.7%).
Monolingual-SBERT excelled at distant versus no recurrence (32.0%).
Bilingual-SBERT performed best for distant versus no recurrence in external validation (AUPRC 17.55%–24.39%).
MetBERT led in distinguishing locoregional versus distant recurrence (88.30%–94.70%).
Bilingual-SBERT demonstrated robust external validation performance (AUPRC 85.25%–91.80%).

Interpretation:

Low AUPRC values (9%–32%) indicate extreme class imbalance in real-world data (~1% recurrence prevalence). Despite this, fine-tuned MetBERT achieved the highest performance, while bilingual-SBERT showed superior robustness during external validation.

Limitations:

Low AUPRC values reflect the extreme class imbalance in the dataset.
Text-length constraints may affect model performance.

Conclusion:

Sentence embedding frameworks provide a practical, generalisable solution for detecting cancer recurrence within multilingual EMRs, suitable for clinical integration as a screening tool for cancer registry workflows.

Detection of cancer recurrence from Thai-English electronic medical records using sentence embeddings

Objective:

Approach:

Key Findings:

Interpretation:

Limitations:

Conclusion:

Original Source(s)

Detection of cancer recurrence from Thai-English electronic medical records using sentence embeddings

Related Content

Incidental thyroid carcinoma in surgically treated multinodular goiter: a retrospective study

Preoperative inflammatory and immune-nutritional markers and postoperative pulmonary complications after gastric and colorectal cancer surgery: a systematic review and narrative synthesis

Smartphones Spot Ocular Malignancies