Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models - Report - MDSpire
Advertisement
Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models
Evaluating Large Language Models for Urology Patient Information Leaflets
Overview
This study assessed the quality and readability of patient information leaflets (PILs) generated by three large language models (LLMs)—ChatGPT-4, PaLM 2, and Llama 2—across four common urological topics. PaLM 2 produced the highest overall quality scores, while Llama 2 excelled in TURP-related content. Readability levels varied but were systematically evaluated using multiple formulas.
Background
Large language models (LLMs) have shown promise in generating human-like text and may assist in reducing clinical workloads by producing patient education materials. In urology, prior research has focused mainly on short answers or clinical vignettes, with limited exploration of extended patient literature. This study aims to fill that gap by comparing three popular LLMs in generating accurate, understandable patient information leaflets for common urological procedures and conditions. Readability is also a critical factor to ensure accessibility for patients with varying literacy levels.
Data Highlights
LLM
Overall Mean Quality Score
Highest Scoring Topic
Highest Topic Score
PaLM 2
3.58
Circumcision
3.95
Llama 2
3.34
TURP / Circumcision
3.5
ChatGPT-4
3.08
Circumcision
3.55
Key Findings
PaLM 2 generated the highest overall quality PILs (mean score 3.58), outperforming Llama 2 and ChatGPT-4 in most topics.
Llama 2 achieved the highest quality score for TURP PILs (3.5), surpassing PaLM 2 and ChatGPT-4 in this specific procedure.
Circumcision PILs received the highest quality scores overall, particularly from PaLM 2 (3.95) and ChatGPT-4 (3.55).
Quality scoring was conducted by a blinded panel of clinicians using a 20-item checklist with a 5-point Likert scale, ensuring rigorous assessment.
Readability was assessed using an average of seven validated formulas, addressing patient comprehension across literacy levels.
Clinical Implications
LLMs like PaLM 2 and Llama 2 show potential to generate high-quality, understandable patient education materials in urology, which may aid clinicians in improving patient communication and reducing workload. However, variability between models and topics suggests the need for clinician review before clinical use. Readability assessments ensure materials are accessible to diverse patient populations, supporting informed decision-making.
Conclusion
This study demonstrates that mainstream LLMs can produce medically accurate and comprehensible urology patient information leaflets, with PaLM 2 generally providing the highest quality outputs. These findings support further integration of AI-generated educational content into clinical practice, with appropriate oversight.
References
University College Cork Medical School Social Research Ethics Committee 2023 -- Ethical approval for LLM urology PIL study