Evaluating the Potential of Reasoning Large Language Models to Perpetuate Racial and Gender Disease Stereotypes in Health Care - Report - MDSpire

Evaluating the Potential of Reasoning Large Language Models to Perpetuate Racial and Gender Disease Stereotypes in Health Care

  • By

  • Joshua J Docking

  • Lee X Li

  • Bradley D Menz

  • Stephen Bacchi

  • Ashley M Hopkins

  • Michael J Sorich

  • May 28, 2026

  • 0 min

Share

Clinical Report: Assessing the Risk of Advanced Language Models in Healthcare

Overview

This study evaluates the racial and gender biases present in clinical vignettes generated by reasoning large language models (LLMs), specifically o3-mini and DeepSeek-R1. Findings indicate significant misrepresentation of demographic distributions in generated content.

Background

The integration of large language models (LLMs) in healthcare has the potential to enhance clinical decision-making but also poses risks of perpetuating biases. Previous studies have shown that LLMs can generate content that overrepresents certain racial and gender groups in stereotypical medical conditions.

Data Highlights

A total of 36,000 unique clinical vignettes were generated, revealing misrepresentation in demographic distributions for both race and gender across various medical conditions.

Key Findings

  • Median misrepresentation for o3-mini was 44% for Black patients, with overrepresentation in 78% of conditions.
  • DeepSeek-R1 showed a median misrepresentation of 31% for Black patients, with 89% of conditions exhibiting overrepresentation.
  • Both models demonstrated gender misrepresentation, with o3-mini showing -27% for female patients and DeepSeek-R1 showing -23%.
  • χ2 goodness-of-fit tests confirmed that generated demographic distributions differed significantly from epidemiological baselines for all conditions.
  • Both models overrepresented Black populations in conditions stereotypically associated with them.

Clinical Implications

The findings highlight the need for careful evaluation of LLM outputs in clinical settings to prevent the reinforcement of racial and gender stereotypes. Ongoing monitoring and bias assessment are essential to mitigate potential health disparities exacerbated by these technologies.

Conclusion

The study highlights demographic misrepresentation in clinical content generated by reasoning LLMs.

Related Resources & Content

  1. Zack et al., npj Digital Medicine, 2025 -- Assessing the Risk of Advanced Language Models in Healthcare
  2. Critical Care (Springer) — Understanding Generative AI's Influence on Perceptions of Racial and Gender Diversity in Critical Care Medicine: Analyzing Biases, Assessment Methods, and Consequences
  3. npj Digital Medicine — Reasoning red teaming in healthcare not all paths to a desired outcome are desirable
  4. Frontiers in Digital Health — Editorial: Navigating Ethical Issues in Large Language Models: Challenges and Recommended Approaches
  5. Responsible Use of AI in Healthcare
  6. Generative AI Profile
  7. AI Code of Conduct for Health and Medicine
  8. Sociodemographic biases in medical decision making by large language models | Nature Medicine
  9. Sociodemographic Bias in Large Language... : JAMA Network Open
  10. Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models | npj Digital Medicine
  11. Evaluating and addressing demographic disparities in medical large language models: a systematic review | International Journal for Equity in Health | Full Text

Original Source(s)

Related Content