Large language models provide unsafe answers to patient-posed medical questions - Scorecard - MDSpire

Large language models provide unsafe answers to patient-posed medical questions

By
Rachel L. Draelos
Samina Afreen
Barbara Blasko
Tiffany L. Brazile
Natasha Chase
Dimple Patel Desai
Jessica Evert
Heather L. Gardner
Lauren Herrmann
Aswathy Vaikom House
Stephanie Kass
Marianne Kavan
Kirshma Khemani
Amanda Koire
Lauren M. McDonald
Zahraa Rabeeah
Amy Shah
February 13, 2026

Npj Digital Medicine

Share

Clinical Scorecard: Evaluating the Safety of Large Language Models in Responding to Patient Medical Inquiries

At a Glance

Category	Detail
Condition	Patient medical inquiries seeking primary care advice
Key Mechanisms	Use of large language model (LLM) chatbots to provide medical advice
Target Population	Patients posing medical questions on internal medicine, women’s health, and pediatrics
Care Setting	Primary care and patient-facing digital health environments

Key Highlights

Four publicly available LLM chatbots (Claude, Gemini, GPT-4o, Llama) were evaluated on 222 patient medical questions.
Problematic response rates ranged from 21.6% (Claude) to 43.2% (Llama), with unsafe responses up to 13%.
Unsafe chatbot advice has potential to cause serious patient harm, highlighting urgent need for improved clinical safety.

Guideline-Based Recommendations

Diagnosis

Do not rely solely on LLM chatbot responses for medical diagnosis due to variable safety and accuracy.

Management

Use LLM chatbots cautiously as adjunct tools; verify advice with qualified healthcare professionals.
Avoid using publicly available LLM chatbots as primary sources for medical decision-making.

Monitoring & Follow-up

Continuously evaluate chatbot responses for safety and accuracy using physician-led frameworks.
Monitor patient outcomes when LLM chatbots are used for medical advice to identify potential harms.

Risks

High rates of problematic and unsafe responses can lead to delayed or incorrect diagnosis and treatment.
Potential for serious patient harm exists from inaccurate or misleading chatbot advice.

Patient & Prescribing Data

Adult patients and caregivers seeking primary care medical advice via chatbots

Millions of patients may receive unsafe or problematic medical advice from publicly available LLM chatbots, necessitating caution and further safety improvements.

Clinical Best Practices

Incorporate physician oversight when deploying LLM chatbots for patient medical inquiries.
Educate patients on limitations and risks of chatbot-provided medical advice.
Prioritize development and validation of LLMs with enhanced clinical safety features before widespread use.
Use structured evaluation frameworks to benchmark chatbot safety regularly.

References

Original Source(s)

Npj Digital Medicine

Large language models provide unsafe answers to patient-posed medical questions

by Rachel L. Draelos, Samina Afreen, Barbara Blasko, Tiffany L. Brazile, Natasha Chase, Dimple Patel Desai, Jessica Evert, Heather L. Gardner, Lauren Herrmann, Aswathy Vaikom House, Stephanie Kass, Marianne Kavan, Kirshma Khemani, Amanda Koire, Lauren M. McDonald, Zahraa Rabeeah, Amy Shah
February 13, 2026

Related Content

Frontiers In Immunology

Anti-inflammatory CAR-microglia targeting Aβ for Alzheimer’s disease therapy

by Xizhong Ding, Xukai Hu, Wanqiang Xue, Yingrui An, Wei Cheng, Shaolong Zhang, Anhua Lei, Jin Zhang
July 14, 2026

Frontiers In Oncology

Venetoclax combined with azacitidine in the treatment of secondary myelodysplastic syndrome following multiple myeloma: a case report and literature review

by Lijun Shi, Zhongrui Ma, Xia Yu, Tian Wang, Li Wei, Yaning Pan, Tantian Jiang, Xiujin Wu
July 14, 2026

Frontiers In Cardiovascular Medicine

Correction: Excessive erythrocytosis and the hypertensive phenotype at high altitude: emerging evidence and unresolved questions

by Yanan Li, Jun Ma, Xin Zhang, Jialiang Zhang, Xiaoping Chen
July 14, 2026