Low-energy Small Language Models with Retrieval-Augmented Generation can Surpass Large-Model Performance in Rheumatology

By
Felde, Sabine
Buchkremer, Rüdiger
Chehab, Gamal
Thielscher, Christian
Distler, Jörg HW
Schneider, Matthias
Richter, Jutta G
April 23, 2026
0 min

Frontiers In Medicine

Overview

This study evaluates the performance of smaller language models (SLMs) enhanced with retrieval-augmented generation (RAG) in rheumatology applications. Findings indicate that SLMs can achieve diagnostic and therapeutic precision comparable to larger models while requiring fewer computational resources.

Background

The integration of artificial intelligence in clinical decision support is gaining traction, particularly in complex fields like rheumatology. Large language models (LLMs) face challenges related to computational demands and potential inaccuracies, making smaller models with RAG a promising alternative. Understanding their efficacy is crucial for improving clinical outcomes and resource efficiency.

Data Highlights

Model	Diagnostic F1 Score	Therapeutic F1 Score	RAGAS Score
Mixtral-8x7b-32768 with RAG	72%	73%	81%
Nemotron-70b without RAG	71%	N/A	N/A
Qwen-Turbo without RAG	N/A	72%	N/A

Key Findings

Mixtral-8x7b-32768 with RAG achieved the highest diagnostic (72%) and therapeutic (73%) F1 scores.
Nemotron-70b demonstrated strong diagnostic capability without RAG (71%).
Qwen-Turbo excelled in therapeutic suggestions without retrieval (72%).
The highest RAGAS score was recorded for Mixtral with RAG (81%).
Performance varied significantly across models and configurations.
Clinically relevant errors were noted across all models, necessitating expert oversight.

Clinical Implications

The findings suggest that smaller language models with RAG can serve as effective tools for clinical decision support in rheumatology, potentially reducing computational costs. However, the presence of clinically relevant errors underscores the importance of expert validation in their application.

Conclusion

SLMs enhanced with RAG represent a viable alternative to larger models in clinical settings, offering comparable performance with reduced resource demands. Continued evaluation and oversight are essential for safe implementation.