AI Scribes Lag Clinicians on Note Quality

A VHA study across 11 vendors finds AI-generated primary care notes score lower than clinician-written notes, with the largest deficits in thoroughness, organization, and usefulness

By
Kerri Miller
April 17, 2026
6 min

Conexiant

Objective:

To evaluate the quality of notes generated by AI scribe tools compared to those written by clinicians specifically in primary care scenarios.

Approach:

Key Findings:

Human-generated notes scored higher than AI-generated notes across all five cases, with statistically significant differences noted in three scenarios.
The largest gap emerged in the acute low back pain scenario, where human notes averaged 43.8 points compared with 20.3 points for AI-generated notes.
AI-generated notes scored lower in all 10 quality domains, with the largest deficits in thoroughness, organization, and usefulness.

Interpretation:

The study indicates that while AI scribes may improve efficiency, they currently produce documentation of lower quality compared to human clinicians, which has significant implications for patient care.

Limitations:

Simulated cases may not reflect real-world clinical complexity.
Human notes were not produced in typical clinical workflows.
Rater blinding may have been imperfect, and the PDQI-9 may not fully capture AI-specific errors.
Vendors were not permitted to generate multiple iterations of notes, which could influence AI performance.

Conclusion:

AI scribes should be used to generate draft documentation that requires thorough clinician review and editing, rather than replacing clinician-authored notes.