Patient perspective on large-language model responses to questions about Moyamoya - Report - MDSpire

Patient perspective on large-language model responses to questions about Moyamoya

  • By

  • Marcella R. Ruppert-Gomez

  • Joon Hyeok Choi

  • Steven J. Staffa

  • Katherine Holste

  • Jordan Xu

  • Catherine Stratton

  • Sophia D. Kocher

  • Edward R. Smith

  • Alfred Pokmeng See

  • February 26, 2026

  • 0 min

Share

Clinical Report: Patient and Clinician Perspectives on LLM Responses for Moyamoya Disease

Overview

This study evaluated ChatGPT-4o and Gemini 1.5 Flash large language models (LLMs) for their accuracy, safety, and helpfulness in answering common patient questions about moyamoya disease. While patients rated LLM responses comparably to physician answers, clinicians identified significant omissions related to risks, urgent symptoms, and recent research.

Background

Moyamoya disease is a rare cerebrovascular disorder requiring specialized knowledge for diagnosis and management. Patients often seek accessible information online, including from artificial intelligence chatbots. Large language models have become widely used tools for health information, but their reliability and safety in complex neurological conditions remain uncertain. Evaluating LLM responses against clinical standards is critical to understand their role in patient education and care.

Data Highlights

MetricChatGPTGeminip-value
Number of response sets evaluated by community2720
Responses reported as "short" (%)1.2%20.8%<0.001
Failure to address risks of procedures/medications (%)38%28.6%
Omission of when to consult medical professional (%)27.2%40.8%
Community rating responses as similar or better than physician (%)72.2% (47.8% similar + 24.4% better)71.4% (49% similar + 22.4% better)
Clinician rating of failure to address recent advances (%)57.5%62.5%
Clinician rating of failure to address urgent symptoms (%)70.0%70.0%

Key Findings

  • ChatGPT responses were significantly less likely to be "short" compared to Gemini (1.2% vs 20.8%, p < 0.001).
  • Both LLMs frequently failed to discuss potential risks associated with procedures and medications they mentioned (ChatGPT 38%, Gemini 28.6%).
  • Omission of guidance on when self-care is insufficient and medical consultation is needed was common (ChatGPT 27.2%, Gemini 40.8%).
  • Community respondents rated LLM answers as similar or somewhat better than physician-provided information in over 70% of cases.
  • Clinicians noted that LLM responses often lacked coverage of recent research advances (ChatGPT 57.5%, Gemini 62.5%) and failed to highlight urgent symptoms requiring referral (both 70%).
  • Overall, LLMs provide accessible information but have important safety and completeness limitations.

Clinical Implications

Clinicians should be aware that while LLMs may be perceived by patients as helpful and comparable to physician advice, these models currently omit critical safety information and fail to emphasize urgent clinical signs. Healthcare providers should guide patients to use LLM-generated information cautiously and reinforce the importance of professional evaluation for symptom changes or treatment decisions. Continued refinement of LLMs is needed to improve accuracy and safety in complex neurological diseases like moyamoya.

Conclusion

Large language models offer accessible moyamoya disease information that patients find comparable to physician responses; however, significant gaps in safety, risk communication, and up-to-date clinical guidance remain. Careful integration with clinical oversight is essential to optimize patient education and safety.

References

  1. Alkaissi H, McFarlane SI (2023) Artificial hallucinations in ChatGPT: Implications in scientific writing
  2. Araki Y et al (2021) Postoperative stroke and neurological outcomes after revascularization for moyamoya disease
  3. Ayers JW et al (2023) Comparing physician and AI chatbot responses to patient questions
  4. Bhattacharyya M et al (2023) High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content
  5. Huang KT et al (2024) Evaluation of the safety, accuracy, and helpfulness of GPT-4.0 in neurosurgery
  6. Kim J et al (2024) Patient perspectives on large language model responses to patient messages

Original Source(s)

Related Content