Large language models provide unsafe answers to patient-posed medical questions - Scorecard - MDSpire

Large language models provide unsafe answers to patient-posed medical questions

  • By

  • Rachel L. Draelos

  • Samina Afreen

  • Barbara Blasko

  • Tiffany L. Brazile

  • Natasha Chase

  • Dimple Patel Desai

  • Jessica Evert

  • Heather L. Gardner

  • Lauren Herrmann

  • Aswathy Vaikom House

  • Stephanie Kass

  • Marianne Kavan

  • Kirshma Khemani

  • Amanda Koire

  • Lauren M. McDonald

  • Zahraa Rabeeah

  • Amy Shah

  • February 13, 2026

  • 0 min

Share

Clinical Scorecard: Evaluating the Safety of Large Language Models in Responding to Patient Medical Inquiries

At a Glance

CategoryDetail
ConditionPatient medical inquiries seeking primary care advice
Key MechanismsUse of large language model (LLM) chatbots to provide medical advice
Target PopulationPatients posing medical questions on internal medicine, women’s health, and pediatrics
Care SettingPrimary care and patient-facing digital health environments

Key Highlights

  • Four publicly available LLM chatbots (Claude, Gemini, GPT-4o, Llama) were evaluated on 222 patient medical questions.
  • Problematic response rates ranged from 21.6% (Claude) to 43.2% (Llama), with unsafe responses up to 13%.
  • Unsafe chatbot advice has potential to cause serious patient harm, highlighting urgent need for improved clinical safety.

Guideline-Based Recommendations

Diagnosis

  • Do not rely solely on LLM chatbot responses for medical diagnosis due to variable safety and accuracy.

Management

  • Use LLM chatbots cautiously as adjunct tools; verify advice with qualified healthcare professionals.
  • Avoid using publicly available LLM chatbots as primary sources for medical decision-making.

Monitoring & Follow-up

  • Continuously evaluate chatbot responses for safety and accuracy using physician-led frameworks.
  • Monitor patient outcomes when LLM chatbots are used for medical advice to identify potential harms.

Risks

  • High rates of problematic and unsafe responses can lead to delayed or incorrect diagnosis and treatment.
  • Potential for serious patient harm exists from inaccurate or misleading chatbot advice.

Patient & Prescribing Data

Adult patients and caregivers seeking primary care medical advice via chatbots

Millions of patients may receive unsafe or problematic medical advice from publicly available LLM chatbots, necessitating caution and further safety improvements.

Clinical Best Practices

  • Incorporate physician oversight when deploying LLM chatbots for patient medical inquiries.
  • Educate patients on limitations and risks of chatbot-provided medical advice.
  • Prioritize development and validation of LLMs with enhanced clinical safety features before widespread use.
  • Use structured evaluation frameworks to benchmark chatbot safety regularly.

References

Original Source(s)

Related Content