Clinical Scorecard: Evaluating the Safety of Large Language Models in Responding to Patient Medical Inquiries
At a Glance
Category
Detail
Condition
Patient medical inquiries seeking primary care advice
Key Mechanisms
Use of large language model (LLM) chatbots to provide medical advice
Target Population
Patients posing medical questions on internal medicine, women’s health, and pediatrics
Care Setting
Primary care and patient-facing digital health environments
Key Highlights
Four publicly available LLM chatbots (Claude, Gemini, GPT-4o, Llama) were evaluated on 222 patient medical questions.
Problematic response rates ranged from 21.6% (Claude) to 43.2% (Llama), with unsafe responses up to 13%.
Unsafe chatbot advice has potential to cause serious patient harm, highlighting urgent need for improved clinical safety.
Guideline-Based Recommendations
Diagnosis
Do not rely solely on LLM chatbot responses for medical diagnosis due to variable safety and accuracy.
Management
Use LLM chatbots cautiously as adjunct tools; verify advice with qualified healthcare professionals.
Avoid using publicly available LLM chatbots as primary sources for medical decision-making.
Monitoring & Follow-up
Continuously evaluate chatbot responses for safety and accuracy using physician-led frameworks.
Monitor patient outcomes when LLM chatbots are used for medical advice to identify potential harms.
Risks
High rates of problematic and unsafe responses can lead to delayed or incorrect diagnosis and treatment.
Potential for serious patient harm exists from inaccurate or misleading chatbot advice.
Patient & Prescribing Data
Adult patients and caregivers seeking primary care medical advice via chatbots
Millions of patients may receive unsafe or problematic medical advice from publicly available LLM chatbots, necessitating caution and further safety improvements.
Clinical Best Practices
Incorporate physician oversight when deploying LLM chatbots for patient medical inquiries.
Educate patients on limitations and risks of chatbot-provided medical advice.
Prioritize development and validation of LLMs with enhanced clinical safety features before widespread use.
Use structured evaluation frameworks to benchmark chatbot safety regularly.