Cracks in the AI Crystal Ball: Why Clinical Prediction Tools Fall Short in the Real World

By
David Gamble
Andrew Wong
Amiran Baduashvili
June 22, 2026
0 min

Journal Of General Internal Medicine

Background

The integration of AI-driven predictive tools in electronic health records (EHRs) is becoming increasingly common in clinical practice. However, the accuracy and reliability of these tools remain uncertain. Understanding the limitations of these models is crucial for clinicians who rely on them for decision-making.

Data Highlights

Model	Vendor AUROC	Pooled AUROC
Sepsis Model	0.77	0.62
End-of-Life Care Index	0.89	0.76
Patient No-Show Model	0.77	0.62
Unplanned Readmission Model	0.74	0.70
Deterioration Index	0.80	0.79

Key Findings

The pooled AUROC estimates for predictive models were consistently lower than Epic's reported benchmarks.
For sepsis, readmission, and end-of-life models, the 95% confidence intervals around pooled estimates did not overlap with Epic's benchmarks.
Every model exhibited high heterogeneity, indicating performance variability across healthcare settings.
Data leakage and model drift are significant factors contributing to the degradation of model performance post-deployment.
Clinicians face ethical uncertainties regarding the reliance on AI outputs for patient care decisions.

Clinical Implications

Clinicians should be aware of the discrepancies between model performance in development and real-world application. Continuous evaluation and validation of these tools are necessary.

Conclusion

The findings reveal gaps in the predictive capabilities of AI tools in clinical practice, highlighting the need for further investigation into their effectiveness and the factors influencing model performance.