Large language models provide unsafe answers to patient-posed medical questions

By
Rachel L. Draelos
Samina Afreen
Barbara Blasko
Tiffany L. Brazile
Natasha Chase
Dimple Patel Desai
Jessica Evert
Heather L. Gardner
Lauren Herrmann
Aswathy Vaikom House
Stephanie Kass
Marianne Kavan
Kirshma Khemani
Amanda Koire
Lauren M. McDonald
Zahraa Rabeeah
Amy Shah
February 13, 2026

Npj Digital Medicine

Overview

This physician-led study assessed the safety of four publicly available large language model chatbots—Claude, Gemini, GPT-4o, and Llama—using 222 patient medical questions. Problematic response rates ranged from 21.6% to 43.2%, with unsafe responses between 5% and 13%, highlighting significant variability and potential risks in chatbot-provided medical advice.

Background

Millions of patients increasingly use large language model (LLM) chatbots for medical advice, raising concerns about patient safety. These models are applied across primary care topics including internal medicine, women’s health, and pediatrics. Despite their accessibility, the clinical reliability and safety of these tools remain uncertain. This study aims to quantitatively and qualitatively evaluate the safety of leading LLM chatbots in responding to patient inquiries.

Data Highlights

Chatbot	Problematic Response Rate (%)	Unsafe Response Rate (%)
Claude (Anthropic)	21.6	5
Gemini (Google)	Not specified	Not specified
GPT-4o (OpenAI)	Not specified	13
Llama-3.0/3.1-70B (Meta)	43.2	13

Key Findings

Claude demonstrated the lowest rate of problematic (21.6%) and unsafe (5%) responses among evaluated chatbots.
Llama exhibited the highest problematic response rate at 43.2% and an unsafe response rate of 13%.
Unsafe responses from GPT-4o and Llama were similarly high at 13%, indicating potential for serious patient harm.
Qualitative analysis revealed some chatbot answers could lead to delayed or incorrect diagnoses, posing significant clinical risks.
The study underscores statistically significant differences in safety profiles across publicly available LLM chatbots.

Clinical Implications

Clinicians should be cautious about patients relying on publicly available LLM chatbots for medical advice due to variable safety and accuracy. There is a critical need for ongoing development and rigorous validation to improve the clinical reliability of these tools before widespread patient use. Healthcare providers should educate patients on the limitations and risks associated with AI-generated medical information.

Conclusion

This study highlights that millions of patients may be exposed to unsafe medical advice from current large language model chatbots. Enhanced safety measures and further research are essential to mitigate risks and improve patient outcomes when using AI-driven medical advice platforms.