Evaluating the Potential of Reasoning Large Language Models to Perpetuate Racial and Gender Disease Stereotypes in Health Care
-
By
-
Joshua J Docking
-
Lee X Li
-
Bradley D Menz
-
Stephen Bacchi
-
Ashley M Hopkins
-
Michael J Sorich
-
May 28, 2026
-
Objective:
To evaluate whether reasoning large language models (LLMs) exhibit racial and gender biases in generated clinical content.
Key Findings:
- Both LLMs frequently misrepresented racial and gender distributions in medical conditions.
- For o3-mini, 78% of conditions showed over 20% racial misrepresentation; for DeepSeek-R1, 89%.
- Median misrepresentation for Black patients was 44% for o3-mini and 31% for DeepSeek-R1.
- Gender misrepresentation was significant, with 56% of conditions for o3-mini and 67% for DeepSeek-R1 exceeding 20%.
Interpretation:
The results indicate that reasoning LLMs do not improve upon previous models in terms of demographic representation, often reinforcing stereotypes.
Limitations:
- The study focused on a US context, which may not generalize globally.
- The demographic categories used do not capture all racial and ethnic groups.
Conclusion:
The consistent overrepresentation of certain demographic groups in clinical vignettes was observed.