One Step Closer to Real-Time Detection of Missed Opportunities for Diagnosis in the ED Using LLMs

To evaluate the effectiveness of large language models (LLMs) in identifying missed opportunities for diagnosis in the emergency department.

Study Design: The study evaluated 6 commercially available LLMs to identify missed diagnostic opportunities in 288 emergency department encounters.

The overall prevalence of missed opportunities for diagnosis was 13.5%.
Area under the receiver operating characteristic curves (AUCs) ranged from 0.65 to 0.73 for 72-hour return and 0.57 to 0.82 for floor-to-ICU cohorts.
Models exhibited different sensitivity-specificity tradeoffs, with Claude Sonnet 4 favoring sensitivity and GPT-5mini favoring specificity.
Physician interrater agreement was 81.9%, indicating variability in expert reviews.

The study indicates that LLMs can detect missed diagnostic opportunities in low-prevalence outcomes using unstructured clinical notes.

The study is retrospective in nature.
Aggregate discrimination metrics alone are insufficient to determine model appropriateness for clinical tasks.

The findings indicate progress towards the deployment of LLM-based screening tools for real-time diagnostic safety in emergency medicine.