Which current chatbot is more competent in urological theoretical knowledge? A comparative analysis by the European board of urology in-service assessment

By
Mehmet Fatih Şahin
Çağrı Doğan
Erdem Can Topkaç
Serkan Şeramet
Furkan Batuhan Tuncer
Cenk Murat Yazıcı
February 11, 2025
0 min

World Journal Of Urology

Overview

This study evaluated five advanced chatbots on the European Board of Urology In-Service Assessment questions, revealing Copilot Pro as the top performer with a 71.6% success rate. GPT-4o and Gemini Advanced followed, demonstrating varying strengths across urological subtopics. The findings highlight current AI capabilities in theoretical urological knowledge assessment.

Background

Chatbots powered by large language models are increasingly used by patients and clinicians for medical information retrieval. While these AI systems can access vast data, clinical success also depends on interpreting complex patient information, a skill traditionally unique to humans. The European Board of Urology (EBU) In-Service Assessment (ISA) provides a standardized, high-level test of urological knowledge, making it an ideal benchmark to evaluate chatbot proficiency. This study aimed to compare the performance of five licensed chatbots on EBU ISA questions to assess their theoretical knowledge and interpretative abilities in urology.

Data Highlights

Chatbot	Overall Success Rate (%)	Exam 1 (%)	Exam 2 (%)	Exam 3 (%)	Top Subtopic Performance (%)
Copilot Pro	71.6	Not specified	100 (Transplantation, Exam 2)	Not specified	Transplantation/Nephrology 77.8, Pediatrics/Congenital 75.0, Andrology/Infertility 72.7
GPT-4o	65.8	71.4	Not specified	56.5	Lithiasis/Infections 73.7, Miscellaneous 73.0
Gemini Advanced	68.5	Not specified	Not specified	Not specified	Miscellaneous 81.1

Key Findings

Copilot Pro achieved the highest overall success rate of 71.6%, passing all three exams and excelling in transplantation/nephrology with a perfect 100% score in Exam 2.
GPT-4o passed all exams with a 65.8% overall success rate, performing best in lithiasis/infections and miscellaneous categories but showing lower accuracy in trauma/emergency and transplantation/nephrology.
Gemini Advanced ranked second-best overall with 68.5%, notably achieving the highest score in the miscellaneous subtopic (81.1%).
Performance varied significantly across urological subtopics, with trauma/emergency and transplantation/nephrology being challenging areas for most chatbots.
The study utilized 596 multiple-choice questions from three EBU ISA exams, ensuring a comprehensive assessment of theoretical knowledge aligned with current EAU guidelines.

Clinical Implications

These findings suggest that advanced chatbots, particularly Copilot Pro, can reliably assist clinicians and trainees in accessing and reviewing urological knowledge based on standardized European guidelines. However, variability in performance across subtopics indicates that AI support should complement, not replace, expert clinical judgment, especially in complex or interpretative scenarios. Continuous updates and training of AI models are essential to improve their utility in clinical education and decision support.

Conclusion

Current state-of-the-art chatbots demonstrate promising proficiency in answering urological board exam questions, with Copilot Pro leading in overall accuracy. While AI can enhance knowledge acquisition, human expertise remains crucial for nuanced clinical interpretation.

References

European Board of Urology In-Service Assessment Data and Chatbot Evaluation Study 2024

Which current chatbot is more competent in urological theoretical knowledge? A comparative analysis by the European board of urology in-service assessment

Clinical Report: Comparative Performance of Chatbots on European Board of Urology Exam

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Which current chatbot is more competent in urological theoretical knowledge? A comparative analysis by the European board of urology in-service assessment

Related Content

Simultaneous outlet surgery for bladder stones and BPO: a scoping review from EAU endourology - challenging the traditional approach

Gross hematuria in nonagenarians admitted to a urological ward: prevalence, predictors, and outcomes

Reliability and validity of Arabic version of Lower Urinary Tract Dysfunction Research Network Symptom Index-10 questionnaire (LURN SI-10)