HealthContradict: Evaluating biomedical knowledge conflicts in language models

By
Boya Zhang
Alban Bornet
Rui Yang
Nan Liu
Douglas Teodoro
January 21, 2026

Npj Digital Medicine

Overview

Language models often generate plausible but incorrect biomedical information due to conflicts between their parametric knowledge and contextual evidence. The HealthContradict dataset was developed to systematically assess how models handle contradictory biomedical information and identify factual answers supported by scientific evidence.

Background

Language models are increasingly used to provide medical advice but are prone to hallucinations and misinformation, especially when faced with conflicting biomedical knowledge. Existing mitigation strategies focus on retrieval relevance or static fact-checking but often overlook contradictions arising from multiple sources or between model knowledge and context. Biomedical knowledge conflicts are complex due to specialized terminology and nuanced syntax, complicating the models' ability to generate accurate answers. To address this, HealthContradict offers a benchmark with contradictory document pairs and verified factual answers to evaluate model performance in realistic biomedical scenarios.

Data Highlights

The HealthContradict dataset comprises 920 unique instances, each containing a health-related question, two contradictory documents, and a scientifically supported factual answer. Models ranging from 1 billion to 8 billion parameters, including general-domain and biomedical-specific variants, were evaluated on their ability to resolve these conflicts and provide accurate responses.

Key Findings

Language models exhibit confusion when presented with conflicting biomedical information from parametric and contextual sources.
Models tend to rely on parametric knowledge when contextual information is self-contradictory but can be influenced by coherent contextual knowledge.
Current mitigation methods often address either parametric or contextual conflicts but rarely both simultaneously.
Biomedical knowledge conflicts are more complex than general-domain conflicts due to domain-specific language and longer, nuanced texts.
HealthContradict enables evaluation of model robustness in handling real-world biomedical contradictions with verified factual answers.
Smaller models (1B parameters) and general-domain models perform less reliably compared to larger and biomedical-specific models in resolving knowledge conflicts.

Clinical Implications

Clinicians and healthcare professionals should be cautious when relying on language model outputs for medical advice, especially when contradictory information exists. Incorporating context-aware and truthfulness-focused strategies is essential to improve model reliability in biomedical applications. The HealthContradict benchmark provides a valuable tool for developing and validating models that better handle conflicting biomedical knowledge, ultimately enhancing patient safety.

Conclusion

Addressing knowledge conflicts in biomedical language models is critical to ensuring accurate and trustworthy medical information. The HealthContradict dataset offers a novel framework to evaluate and improve model performance in the presence of contradictory evidence, advancing the safe deployment of AI in healthcare.

References

HealthContradict Dataset and Evaluation -- Assessing Conflicts in Biomedical Knowledge Within Language Models

HealthContradict: Evaluating biomedical knowledge conflicts in language models

Clinical Report: Evaluating Biomedical Knowledge Conflicts in Language Models

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

HealthContradict: Evaluating biomedical knowledge conflicts in language models

Related Content

Precision medicine directed therapy enabling long-term survival in medulloblastoma: a case report

Non-bacterial cystitis following treatment with toripalimab for alpha-fetoprotein-producing gastric adenocarcinoma: a case report

Bimodal ultrasound assessment of cerebral hemodynamics in preterm infants stratified by maternal immunotherapy: implications for early prediction of intraventricular hemorrhage