Large language models provide unsafe answers to patient-posed medical questions - Report - MDSpire

Large language models provide unsafe answers to patient-posed medical questions

  • By

  • Rachel L. Draelos

  • Samina Afreen

  • Barbara Blasko

  • Tiffany L. Brazile

  • Natasha Chase

  • Dimple Patel Desai

  • Jessica Evert

  • Heather L. Gardner

  • Lauren Herrmann

  • Aswathy Vaikom House

  • Stephanie Kass

  • Marianne Kavan

  • Kirshma Khemani

  • Amanda Koire

  • Lauren M. McDonald

  • Zahraa Rabeeah

  • Amy Shah

  • February 13, 2026

  • 0 min

Share

Clinical Report: Safety Evaluation of Large Language Models for Patient Medical Advice

Overview

This physician-led study assessed the safety of four publicly available large language model chatbots—Claude, Gemini, GPT-4o, and Llama—using 222 patient medical questions. Problematic response rates ranged from 21.6% to 43.2%, with unsafe responses between 5% and 13%, highlighting significant variability and potential risks in chatbot-provided medical advice.

Background

Millions of patients increasingly use large language model (LLM) chatbots for medical advice, raising concerns about patient safety. These models are applied across primary care topics including internal medicine, women’s health, and pediatrics. Despite their accessibility, the clinical reliability and safety of these tools remain uncertain. This study aims to quantitatively and qualitatively evaluate the safety of leading LLM chatbots in responding to patient inquiries.

Data Highlights

ChatbotProblematic Response Rate (%)Unsafe Response Rate (%)
Claude (Anthropic)21.65
Gemini (Google)Not specifiedNot specified
GPT-4o (OpenAI)Not specified13
Llama-3.0/3.1-70B (Meta)43.213

Key Findings

  • Claude demonstrated the lowest rate of problematic (21.6%) and unsafe (5%) responses among evaluated chatbots.
  • Llama exhibited the highest problematic response rate at 43.2% and an unsafe response rate of 13%.
  • Unsafe responses from GPT-4o and Llama were similarly high at 13%, indicating potential for serious patient harm.
  • Qualitative analysis revealed some chatbot answers could lead to delayed or incorrect diagnoses, posing significant clinical risks.
  • The study underscores statistically significant differences in safety profiles across publicly available LLM chatbots.

Clinical Implications

Clinicians should be cautious about patients relying on publicly available LLM chatbots for medical advice due to variable safety and accuracy. There is a critical need for ongoing development and rigorous validation to improve the clinical reliability of these tools before widespread patient use. Healthcare providers should educate patients on the limitations and risks associated with AI-generated medical information.

Conclusion

This study highlights that millions of patients may be exposed to unsafe medical advice from current large language model chatbots. Enhanced safety measures and further research are essential to mitigate risks and improve patient outcomes when using AI-driven medical advice platforms.

References

  1. Diekmann et al. 2025 -- Evaluating safety of large language models for patient-facing medical question answering
  2. Huo et al. 2025 -- Large language models for chatbot health advice studies: a systematic review
  3. Singhal et al. 2025 -- Toward expert-level medical question answering with large language models

Original Source(s)

Related Content