Clinical Report: Assessing the Risk of Advanced Language Models in Healthcare
Overview
This study evaluates the racial and gender biases present in clinical vignettes generated by reasoning large language models (LLMs), specifically o3-mini and DeepSeek-R1. Findings indicate significant misrepresentation of demographic distributions in generated content.
Background
The integration of large language models (LLMs) in healthcare has the potential to enhance clinical decision-making but also poses risks of perpetuating biases. Previous studies have shown that LLMs can generate content that overrepresents certain racial and gender groups in stereotypical medical conditions.
Data Highlights
A total of 36,000 unique clinical vignettes were generated, revealing misrepresentation in demographic distributions for both race and gender across various medical conditions.
Key Findings
Median misrepresentation for o3-mini was 44% for Black patients, with overrepresentation in 78% of conditions.
DeepSeek-R1 showed a median misrepresentation of 31% for Black patients, with 89% of conditions exhibiting overrepresentation.
Both models demonstrated gender misrepresentation, with o3-mini showing -27% for female patients and DeepSeek-R1 showing -23%.
χ2 goodness-of-fit tests confirmed that generated demographic distributions differed significantly from epidemiological baselines for all conditions.
Both models overrepresented Black populations in conditions stereotypically associated with them.
Clinical Implications
The findings highlight the need for careful evaluation of LLM outputs in clinical settings to prevent the reinforcement of racial and gender stereotypes. Ongoing monitoring and bias assessment are essential to mitigate potential health disparities exacerbated by these technologies.
Conclusion
The study highlights demographic misrepresentation in clinical content generated by reasoning LLMs.