Leveraging simulation to provide a practical framework for estimating the novel scope of risk of large language models in healthcare

By
Mark Kalinich
James Luccarelli
John Santa Maria, Jr
Frank Moss
John Torous
June 24, 2026
0 min

Bmj Mental Health

Overview

This study demonstrates a simulation-based methodology for assessing risks associated with large language models as software medical devices (LLM-SaMD). It highlights the variability in model performance across different safety tasks.

Background

Large language models (LLMs) are increasingly integrated into healthcare, but their probabilistic outputs can lead to significant patient safety concerns. Existing medical device risk management frameworks may not fully address the unique risks posed by LLMs.

Data Highlights

Task	P1 Range	P2 Range
Suicidal Ideation	1.1×10⁻⁸ to 1.6×10⁻⁴	4.9×10⁻⁵ to 5.1×10⁻³
Therapy Request	Varied	Varied
Therapy-like Interaction Detection	Varied	Varied

Key Findings

Fourteen open-source LLMs were evaluated on three safety-classification tasks.
Model performance improved with size, particularly in generating neutral and non-therapeutic content.
Frequent errors were noted in detecting suicidal ideation and therapy-like interactions.
Estimated probabilities (P1 and P2) for hazards varied significantly across tasks.
Simulation can link model failure modes to pathways of harm, aiding in risk assessment.

Clinical Implications

Simulation-based risk estimation offers a method for evaluating the safety of LLM-SaMD in various clinical contexts.

Conclusion

Simulation can help address the challenges posed by LLMs in healthcare.

Leveraging simulation to provide a practical framework for estimating the novel scope of risk of large language models in healthcare

Clinical Report: Utilizing Simulation to Assess Risks of Large Language Models

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

Related Resources & Content

Original Source(s)

Leveraging simulation to provide a practical framework for estimating the novel scope of risk of large language models in healthcare

Related Content

Cyberbullying victimization identification and large language model-assisted assessment: a study of cyberbullying victimization lexicon construction and validation

In vivo probing of purinergic P2X7 as a potential biomarker for suicide risk: a hypothesis

Beta-2 Agonists Tied to ASD Risk?