Development and Validation of a Machine Learning–Based Screening Algorithm to Predict High-Risk Hepatitis C Infection

By
Suk-Chan Jang
Wei-Hsuan Lo-Ciganic
Pilar Hernandez-Con
Chanakan Jenjai
James Huang
Ashley Stultz
Shunhua Yan
Debbie L Wilson
Ashley Norse
Faheem W Guirgis
Robert L Cook
Christine Gage
Khoa A Nguyen
Patrick Hornes
Yonghui Wu
David R Nelson
Haesuk Park
August 15, 2025
0 min

Open Forum Infectious Diseases

Overview

A machine learning-based algorithm was developed and validated using a large electronic health record database to predict individuals at high risk for hepatitis C virus (HCV) infection. The gradient boosting machine model demonstrated superior performance with a C statistic of 0.916, achieving high sensitivity and specificity, effectively stratifying patients by risk.

Background

Hepatitis C virus infection is a leading cause of liver-related morbidity and mortality in the United States, with rising incidence linked to the opioid epidemic and injection drug use. Despite effective antiviral treatments, many individuals remain undiagnosed due to the asymptomatic nature of HCV. Universal screening is recommended but presents practical challenges, highlighting the need for targeted, efficient screening methods. Machine learning applied to electronic health records offers a promising approach to identify high-risk individuals for focused testing.

Data Highlights

Model	C Statistic (95% CI)	Sensitivity (%)	Specificity (%)
Gradient Boosting Machine (GBM)	0.916 (0.911–0.921)	79.39	89.08
Elastic Net (EN)	0.885 (0.879–0.891)	Not reported	Not reported
Random Forest (RF)	0.854 (0.847–0.861)	Not reported	Not reported
Deep Neural Network (DNN)	0.908 (0.903–0.913)	Not reported	Not reported

Key Findings

Among 445,624 individuals tested, 2.65% (11,823) were positive for HCV.
The GBM model outperformed EN, RF, and DNN models with the highest C statistic of 0.916 (P < .0001).
Using the Youden index, GBM achieved 79.39% sensitivity and 89.08% specificity, identifying one positive HCV case per six tests.
75.63% of HCV-positive patients were captured in the top first risk decile, and 90.25% within the top three deciles.
275 sociodemographic and clinical features from the 6 months prior to testing were used as predictors.
ML algorithms can effectively stratify HCV infection risk to support targeted screening in clinical practice.

Clinical Implications

The GBM machine learning model can be integrated into clinical workflows to identify patients at high risk for HCV infection, enabling more efficient and targeted screening. This approach may reduce unnecessary testing and resource use associated with universal screening while improving early detection and linkage to curative treatment. Incorporating such predictive tools can help address the ongoing HCV epidemic by identifying undiagnosed cases.

Conclusion

Machine learning algorithms, particularly gradient boosting machines, provide a robust method to predict and stratify hepatitis C infection risk using electronic health record data. This targeted screening tool holds promise for enhancing HCV detection and improving public health outcomes.

References

OneFlorida+ Database Study 2016–2023 -- Creation and Assessment of a Machine Learning Algorithm for Identifying Individuals at High Risk for Hepatitis C Infection

Development and Validation of a Machine Learning–Based Screening Algorithm to Predict High-Risk Hepatitis C Infection

Machine Learning Algorithm Accurately Identifies High-Risk Hepatitis C Patients

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Development and Validation of a Machine Learning–Based Screening Algorithm to Predict High-Risk Hepatitis C Infection

Related Content

AI Scribes: Efficiency for Whom?

Assessment of Liver Cancer Burden from 1990 to 2021 with Projections for 2040: Findings from the 2021 Global Burden of Disease Study

At the Hart of Pathology