Development and Validation of a Machine Learning–Based Screening Algorithm to Predict High-Risk Hepatitis C Infection - Report - MDSpire

Development and Validation of a Machine Learning–Based Screening Algorithm to Predict High-Risk Hepatitis C Infection

  • By

  • Suk-Chan Jang

  • Wei-Hsuan Lo-Ciganic

  • Pilar Hernandez-Con

  • Chanakan Jenjai

  • James Huang

  • Ashley Stultz

  • Shunhua Yan

  • Debbie L Wilson

  • Ashley Norse

  • Faheem W Guirgis

  • Robert L Cook

  • Christine Gage

  • Khoa A Nguyen

  • Patrick Hornes

  • Yonghui Wu

  • David R Nelson

  • Haesuk Park

  • August 15, 2025

  • 0 min

Share

Machine Learning Algorithm Accurately Identifies High-Risk Hepatitis C Patients

Overview

A machine learning-based algorithm was developed and validated using a large electronic health record database to predict individuals at high risk for hepatitis C virus (HCV) infection. The gradient boosting machine model demonstrated superior performance with a C statistic of 0.916, achieving high sensitivity and specificity, effectively stratifying patients by risk.

Background

Hepatitis C virus infection is a leading cause of liver-related morbidity and mortality in the United States, with rising incidence linked to the opioid epidemic and injection drug use. Despite effective antiviral treatments, many individuals remain undiagnosed due to the asymptomatic nature of HCV. Universal screening is recommended but presents practical challenges, highlighting the need for targeted, efficient screening methods. Machine learning applied to electronic health records offers a promising approach to identify high-risk individuals for focused testing.

Data Highlights

ModelC Statistic (95% CI)Sensitivity (%)Specificity (%)
Gradient Boosting Machine (GBM)0.916 (0.911–0.921)79.3989.08
Elastic Net (EN)0.885 (0.879–0.891)Not reportedNot reported
Random Forest (RF)0.854 (0.847–0.861)Not reportedNot reported
Deep Neural Network (DNN)0.908 (0.903–0.913)Not reportedNot reported

Key Findings

  • Among 445,624 individuals tested, 2.65% (11,823) were positive for HCV.
  • The GBM model outperformed EN, RF, and DNN models with the highest C statistic of 0.916 (P < .0001).
  • Using the Youden index, GBM achieved 79.39% sensitivity and 89.08% specificity, identifying one positive HCV case per six tests.
  • 75.63% of HCV-positive patients were captured in the top first risk decile, and 90.25% within the top three deciles.
  • 275 sociodemographic and clinical features from the 6 months prior to testing were used as predictors.
  • ML algorithms can effectively stratify HCV infection risk to support targeted screening in clinical practice.

Clinical Implications

The GBM machine learning model can be integrated into clinical workflows to identify patients at high risk for HCV infection, enabling more efficient and targeted screening. This approach may reduce unnecessary testing and resource use associated with universal screening while improving early detection and linkage to curative treatment. Incorporating such predictive tools can help address the ongoing HCV epidemic by identifying undiagnosed cases.

Conclusion

Machine learning algorithms, particularly gradient boosting machines, provide a robust method to predict and stratify hepatitis C infection risk using electronic health record data. This targeted screening tool holds promise for enhancing HCV detection and improving public health outcomes.

References

  1. OneFlorida+ Database Study 2016–2023 -- Creation and Assessment of a Machine Learning Algorithm for Identifying Individuals at High Risk for Hepatitis C Infection

Original Source(s)

Related Content