Clinical Report: Safety Evaluation of Large Language Models for Patient Medical Advice
Overview
This physician-led study assessed the safety of four publicly available large language model chatbots—Claude, Gemini, GPT-4o, and Llama—using 222 patient medical questions. Problematic response rates ranged from 21.6% to 43.2%, with unsafe responses between 5% and 13%, highlighting significant variability and potential risks in chatbot-provided medical advice.
Background
Millions of patients increasingly use large language model (LLM) chatbots for medical advice, raising concerns about patient safety. These models are applied across primary care topics including internal medicine, women’s health, and pediatrics. Despite their accessibility, the clinical reliability and safety of these tools remain uncertain. This study aims to quantitatively and qualitatively evaluate the safety of leading LLM chatbots in responding to patient inquiries.
Data Highlights
Chatbot
Problematic Response Rate (%)
Unsafe Response Rate (%)
Claude (Anthropic)
21.6
5
Gemini (Google)
Not specified
Not specified
GPT-4o (OpenAI)
Not specified
13
Llama-3.0/3.1-70B (Meta)
43.2
13
Key Findings
Claude demonstrated the lowest rate of problematic (21.6%) and unsafe (5%) responses among evaluated chatbots.
Llama exhibited the highest problematic response rate at 43.2% and an unsafe response rate of 13%.
Unsafe responses from GPT-4o and Llama were similarly high at 13%, indicating potential for serious patient harm.
Qualitative analysis revealed some chatbot answers could lead to delayed or incorrect diagnoses, posing significant clinical risks.
The study underscores statistically significant differences in safety profiles across publicly available LLM chatbots.
Clinical Implications
Clinicians should be cautious about patients relying on publicly available LLM chatbots for medical advice due to variable safety and accuracy. There is a critical need for ongoing development and rigorous validation to improve the clinical reliability of these tools before widespread patient use. Healthcare providers should educate patients on the limitations and risks associated with AI-generated medical information.
Conclusion
This study highlights that millions of patients may be exposed to unsafe medical advice from current large language model chatbots. Enhanced safety measures and further research are essential to mitigate risks and improve patient outcomes when using AI-driven medical advice platforms.