Large language models provide unsafe answers to patient-posed medical questions

1

The study evaluates the safety of four large language model chatbots in providing medical advice to patients.
2

A total of 888 responses to 222 medical inquiries were analyzed, revealing significant differences in safety among the chatbots.
3

The rate of problematic responses ranged from 21.6% for Claude to 43.2% for Llama, indicating varying levels of safety.
4

Unsafe responses were found in 5% of Claude's answers, while GPT-4o and Llama had unsafe rates of 13%.
5

The findings suggest that many patients may receive unsafe medical advice from these chatbots, necessitating further safety improvements.

Npj Digital Medicine

by Rachel L. Draelos, Samina Afreen, Barbara Blasko, Tiffany L. Brazile, Natasha Chase, Dimple Patel Desai, Jessica Evert, Heather L. Gardner, Lauren Herrmann, Aswathy Vaikom House, Stephanie Kass, Marianne Kavan, Kirshma Khemani, Amanda Koire, Lauren M. McDonald, Zahraa Rabeeah, Amy Shah
February 13, 2026

1