Physician vs. AI-generated messages in urology: evaluation of accuracy, completeness, and preference by patients and physicians

By
Eric J. Robinson
Chunyuan Qiu
Stuart Sands
Mohammad Khan
Shivang Vora
Kenichiro Oshima
Khang Nguyen
L. Andrew DiFronzo
David Rhew
Mark I. Feng
December 27, 2024
0 min

World Journal Of Urology

Overview

This study evaluated the accuracy, completeness, tone, and patient preference of answers to common benign prostatic hyperplasia (BPH) questions generated by four urologists and two AI chatbots within a secure sandbox environment. Both AI chatbots, including a retrieval-augmented model, demonstrated reliable performance comparable to physicians in providing accurate and comprehensive responses. Patient and expert evaluations highlighted the potential of AI tools to support clinical communication in urology.

Background

Artificial intelligence, particularly large language models (LLMs), is increasingly integrated into healthcare to enhance physician-patient communication. Chatbots like ChatGPT can provide complex medical information but raise concerns about privacy and data security when used outside controlled environments. Benign prostatic hyperplasia (BPH), a prevalent urologic condition affecting men over 50, serves as an ideal test case for chatbot evaluation due to its high message volume and clinical relevance. This study leveraged a sandbox environment to securely test AI-generated responses to real-world patient questions across the BPH care continuum.

Data Highlights

Twenty common BPH-related patient questions were answered by four board-certified urologists and two AI chatbots (Kaiser Permanente GPT and SurgiChat). Responses were evaluated by two urologist subject matter experts and five male volunteers aged 56–82 for accuracy, completeness, tone, and preference using Likert scales. The sandbox environment ensured no patient health information was exposed during testing. Chatbot answers were provided with prompts to be specific and incorporate applied sources, with some disclaimers removed to maintain evaluator blinding.

Key Findings

Both AI chatbots produced answers with accuracy and completeness comparable to those of experienced urologists based on expert grading.
The retrieval-augmented chatbot (SurgiChat) leveraged authoritative BPH literature to enhance response quality within the sandbox environment.
Patient volunteers rated chatbot responses favorably in terms of tone and clarity, indicating good acceptance of AI-generated communications.
Use of a secure sandbox environment allowed robust testing of AI tools without risking patient data privacy or security.
Chatbots demonstrated the ability to handle open-ended, personalized, and patient-specific questions across the perioperative BPH care spectrum.

Clinical Implications

AI chatbots, when integrated within secure healthcare environments, can reliably support physician-patient communication by providing accurate, comprehensive, and empathetic information on common urologic conditions like BPH. Their use may enhance patient education and engagement while reducing clinician workload in managing routine inquiries. However, careful implementation with attention to data security and clinical oversight remains essential.

Conclusion

This study provides early evidence that AI chatbots can effectively complement physician communication in urology by delivering accurate and complete information in a patient-centered manner. Secure sandbox testing frameworks enable safe evaluation and future integration of such technologies into clinical practice.

References

OpenAI/ChatGPT/2023 -- ChatGPT Medical Applications
STROBE Statement/2007 -- Guidelines for Observational Studies
Kaiser Permanente GPT and SurgiChat/2024 -- AI Chatbots in Urology Sandbox Study

Physician vs. AI-generated messages in urology: evaluation of accuracy, completeness, and preference by patients and physicians

Clinical Report: Comparing Physician and AI-Generated Communications in Urology

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Physician vs. AI-generated messages in urology: evaluation of accuracy, completeness, and preference by patients and physicians

Related Content

Simulation-based training in minimally invasive surgical therapies (MIST): current evidence and future directions for artificial intelligence integration—a systematic review by EAU endourology

Thulium fiber vs. holmium: YAG lasers in urology: insights from the FDA MAUDE database

Yvonne K. P. Koch, M.D., Joins Baptist Health Urology