Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models

By
David Pompili
Yasmina Richa
Patrick Collins
Helen Richards
Derek B Hennessey
July 29, 2024
0 min

World Journal Of Urology

Overview

This study assessed the quality and readability of patient information leaflets (PILs) generated by three large language models (LLMs)—ChatGPT-4, PaLM 2, and Llama 2—across four common urological topics. PaLM 2 produced the highest overall quality scores, while Llama 2 excelled in TURP-related content. Readability levels varied but were systematically evaluated using multiple formulas.

Background

Large language models (LLMs) have shown promise in generating human-like text and may assist in reducing clinical workloads by producing patient education materials. In urology, prior research has focused mainly on short answers or clinical vignettes, with limited exploration of extended patient literature. This study aims to fill that gap by comparing three popular LLMs in generating accurate, understandable patient information leaflets for common urological procedures and conditions. Readability is also a critical factor to ensure accessibility for patients with varying literacy levels.

Data Highlights

LLM	Overall Mean Quality Score	Highest Scoring Topic	Highest Topic Score
PaLM 2	3.58	Circumcision	3.95
Llama 2	3.34	TURP / Circumcision	3.5
ChatGPT-4	3.08	Circumcision	3.55

Key Findings

PaLM 2 generated the highest overall quality PILs (mean score 3.58), outperforming Llama 2 and ChatGPT-4 in most topics.
Llama 2 achieved the highest quality score for TURP PILs (3.5), surpassing PaLM 2 and ChatGPT-4 in this specific procedure.
Circumcision PILs received the highest quality scores overall, particularly from PaLM 2 (3.95) and ChatGPT-4 (3.55).
Quality scoring was conducted by a blinded panel of clinicians using a 20-item checklist with a 5-point Likert scale, ensuring rigorous assessment.
Readability was assessed using an average of seven validated formulas, addressing patient comprehension across literacy levels.

Clinical Implications

LLMs like PaLM 2 and Llama 2 show potential to generate high-quality, understandable patient education materials in urology, which may aid clinicians in improving patient communication and reducing workload. However, variability between models and topics suggests the need for clinician review before clinical use. Readability assessments ensure materials are accessible to diverse patient populations, supporting informed decision-making.

Conclusion

This study demonstrates that mainstream LLMs can produce medically accurate and comprehensible urology patient information leaflets, with PaLM 2 generally providing the highest quality outputs. These findings support further integration of AI-generated educational content into clinical practice, with appropriate oversight.

References

University College Cork Medical School Social Research Ethics Committee 2023 -- Ethical approval for LLM urology PIL study

Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models

Evaluating Large Language Models for Urology Patient Information Leaflets

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models

Related Content

Management of clinical failure after minimally invasive surgical therapies (MIST) for BPH: repeat MIST versus resection, enucleation or ablation—a narrative review from EAU endourology

Knowledge and self-confidence of healthcare workers to perform transurethral catheterization: a matter deserving attention!

Thulium fiber vs. holmium: YAG lasers in urology: insights from the FDA MAUDE database