Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models - Report - MDSpire

Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models

  • By

  • David Pompili

  • Yasmina Richa

  • Patrick Collins

  • Helen Richards

  • Derek B Hennessey

  • July 29, 2024

  • 0 min

Share

Evaluating Large Language Models for Urology Patient Information Leaflets

Overview

This study assessed the quality and readability of patient information leaflets (PILs) generated by three large language models (LLMs)—ChatGPT-4, PaLM 2, and Llama 2—across four common urological topics. PaLM 2 produced the highest overall quality scores, while Llama 2 excelled in TURP-related content. Readability levels varied but were systematically evaluated using multiple formulas.

Background

Large language models (LLMs) have shown promise in generating human-like text and may assist in reducing clinical workloads by producing patient education materials. In urology, prior research has focused mainly on short answers or clinical vignettes, with limited exploration of extended patient literature. This study aims to fill that gap by comparing three popular LLMs in generating accurate, understandable patient information leaflets for common urological procedures and conditions. Readability is also a critical factor to ensure accessibility for patients with varying literacy levels.

Data Highlights

LLMOverall Mean Quality ScoreHighest Scoring TopicHighest Topic Score
PaLM 23.58Circumcision3.95
Llama 23.34TURP / Circumcision3.5
ChatGPT-43.08Circumcision3.55

Key Findings

  • PaLM 2 generated the highest overall quality PILs (mean score 3.58), outperforming Llama 2 and ChatGPT-4 in most topics.
  • Llama 2 achieved the highest quality score for TURP PILs (3.5), surpassing PaLM 2 and ChatGPT-4 in this specific procedure.
  • Circumcision PILs received the highest quality scores overall, particularly from PaLM 2 (3.95) and ChatGPT-4 (3.55).
  • Quality scoring was conducted by a blinded panel of clinicians using a 20-item checklist with a 5-point Likert scale, ensuring rigorous assessment.
  • Readability was assessed using an average of seven validated formulas, addressing patient comprehension across literacy levels.

Clinical Implications

LLMs like PaLM 2 and Llama 2 show potential to generate high-quality, understandable patient education materials in urology, which may aid clinicians in improving patient communication and reducing workload. However, variability between models and topics suggests the need for clinician review before clinical use. Readability assessments ensure materials are accessible to diverse patient populations, supporting informed decision-making.

Conclusion

This study demonstrates that mainstream LLMs can produce medically accurate and comprehensible urology patient information leaflets, with PaLM 2 generally providing the highest quality outputs. These findings support further integration of AI-generated educational content into clinical practice, with appropriate oversight.

References

  1. University College Cork Medical School Social Research Ethics Committee 2023 -- Ethical approval for LLM urology PIL study

Original Source(s)

Related Content