Evaluating the Potential of Reasoning Large Language Models to Perpetuate Racial and Gender Disease Stereotypes in Health Care - Summary - MDSpire

Evaluating the Potential of Reasoning Large Language Models to Perpetuate Racial and Gender Disease Stereotypes in Health Care

  • By

  • Joshua J Docking

  • Lee X Li

  • Bradley D Menz

  • Stephen Bacchi

  • Ashley M Hopkins

  • Michael J Sorich

  • May 28, 2026

  • 0 min

Share

Objective:

To evaluate whether reasoning large language models (LLMs) exhibit racial and gender biases in generated clinical content.

Key Findings:
  • Both LLMs frequently misrepresented racial and gender distributions in medical conditions.
  • For o3-mini, 78% of conditions showed over 20% racial misrepresentation; for DeepSeek-R1, 89%.
  • Median misrepresentation for Black patients was 44% for o3-mini and 31% for DeepSeek-R1.
  • Gender misrepresentation was significant, with 56% of conditions for o3-mini and 67% for DeepSeek-R1 exceeding 20%.
Interpretation:

The results indicate that reasoning LLMs do not improve upon previous models in terms of demographic representation, often reinforcing stereotypes.

Limitations:
  • The study focused on a US context, which may not generalize globally.
  • The demographic categories used do not capture all racial and ethnic groups.
Conclusion:

The consistent overrepresentation of certain demographic groups in clinical vignettes was observed.

Original Source(s)

Related Content