Large language models provide unsafe answers to patient-posed medical questions - Summary - MDSpire

Large language models provide unsafe answers to patient-posed medical questions

  • By

  • Rachel L. Draelos

  • Samina Afreen

  • Barbara Blasko

  • Tiffany L. Brazile

  • Natasha Chase

  • Dimple Patel Desai

  • Jessica Evert

  • Heather L. Gardner

  • Lauren Herrmann

  • Aswathy Vaikom House

  • Stephanie Kass

  • Marianne Kavan

  • Kirshma Khemani

  • Amanda Koire

  • Lauren M. McDonald

  • Zahraa Rabeeah

  • Amy Shah

  • February 13, 2026

  • 0 min

Share

Objective:

To assess the safety of four publicly available chatbots in providing medical advice to patients, focusing on their reliability and potential risks.

Key Findings:
  • Statistically significant differences in safety among chatbots, with implications for patient care.
  • Problematic response rates ranged from 21.6% (Claude) to 43.2% (Llama), indicating varying levels of reliability.
  • Unsafe responses varied from 5% (Claude) to 13% (GPT-4o, Llama), raising concerns about patient safety.
  • Qualitative analysis revealed responses that could potentially lead to serious patient harm, underscoring the need for caution.
Interpretation:

The findings indicate that millions of patients may be receiving unsafe medical advice from chatbots, highlighting an urgent need for improvements in clinical safety protocols.

Limitations:
  • The study only evaluated four specific chatbots, which may not represent the broader landscape of available models.
  • Responses were limited to primary care topics, which may not encompass all medical inquiries, potentially skewing the results.
  • The selection of chatbots may introduce bias, as the chosen models may not reflect the full range of capabilities and safety profiles.
Conclusion:

Further work is necessary to enhance the clinical safety of large language models used in medical advice.

Original Source(s)

Related Content