Evaluating the Potential of Reasoning Large Language Models to Perpetuate Racial and Gender Disease Stereotypes in Health Care

By
Joshua J Docking
Lee X Li
Bradley D Menz
Stephen Bacchi
Ashley M Hopkins
Michael J Sorich
May 28, 2026
0 min

Journal Of Medical Internet Research (Jmir)

Overview

This study evaluates the racial and gender biases present in clinical vignettes generated by reasoning large language models (LLMs), specifically o3-mini and DeepSeek-R1. Findings indicate significant misrepresentation of demographic distributions in generated content.

Background

The integration of large language models (LLMs) in healthcare has the potential to enhance clinical decision-making but also poses risks of perpetuating biases. Previous studies have shown that LLMs can generate content that overrepresents certain racial and gender groups in stereotypical medical conditions.

Data Highlights

A total of 36,000 unique clinical vignettes were generated, revealing misrepresentation in demographic distributions for both race and gender across various medical conditions.

Key Findings

Median misrepresentation for o3-mini was 44% for Black patients, with overrepresentation in 78% of conditions.
DeepSeek-R1 showed a median misrepresentation of 31% for Black patients, with 89% of conditions exhibiting overrepresentation.
Both models demonstrated gender misrepresentation, with o3-mini showing -27% for female patients and DeepSeek-R1 showing -23%.
χ2 goodness-of-fit tests confirmed that generated demographic distributions differed significantly from epidemiological baselines for all conditions.
Both models overrepresented Black populations in conditions stereotypically associated with them.

Clinical Implications

The findings highlight the need for careful evaluation of LLM outputs in clinical settings to prevent the reinforcement of racial and gender stereotypes. Ongoing monitoring and bias assessment are essential to mitigate potential health disparities exacerbated by these technologies.

Conclusion

The study highlights demographic misrepresentation in clinical content generated by reasoning LLMs.

Evaluating the Potential of Reasoning Large Language Models to Perpetuate Racial and Gender Disease Stereotypes in Health Care

Clinical Report: Assessing the Risk of Advanced Language Models in Healthcare

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

Related Resources & Content

Original Source(s)

Evaluating the Potential of Reasoning Large Language Models to Perpetuate Racial and Gender Disease Stereotypes in Health Care

Related Content

The past, present and future use of technology-enabled physical activity interventions in clinical and non-clinical populations: a bibliometric trend analysis across four decades

Reimagining Risk and Resilience

Telenurses’ work environment - Relationships between working conditions, remote work from home or not and the outcomes job satisfaction, burnout and thriving