To develop an interpretable machine learning framework for objective TED severity stratification (mild, moderate-to-severe, sight-threatening) and evaluate how data handling strategies influence model generalizability.
Approach:
Key Findings:
Random Forest with class weighting achieved the highest AUC (0.811).
Random Forest with SMOTE achieved the highest recall (0.669), F1-score (0.648), and specificity (0.815).
Unaccounted longitudinal scan correlations can inflate performance metrics, emphasizing the need for temporal deduplication.
Interpretation:
Controlling for longitudinal redundancy and intra-patient correlations significantly impacts model evaluation and generalizability.
Limitations:
Study is retrospective and may not capture all variables influencing TED severity.
Potential biases in dataset construction and model evaluation should be acknowledged.
Conclusion:
Random Forest with class weighting demonstrated the best discriminative performance on temporally deduplicated scans.