Using a fine-tuned large language model for symptom-based depression evaluation

By
Samantha Weber
Nicolas Deperrois
Robert Heun
Laura Frühschütz
Anna Monn
Stephanie Homan
Andrea Häfliger
Erich Seifritz
Tobias Kowatsch
Birgit Kleim
Sebastian Olbrich
October 7, 2025
0 min

Npj Digital Medicine

Overview

A fine-tuned German BERT-based large language model (MADRS-BERT) accurately predicts individual Montgomery-Åsberg Depression Rating Scale (MADRS) scores from clinical interview transcripts. The model achieved mean absolute errors between 0.7 and 1.0 and accuracies of 79–88%, closely matching clinician ratings and outperforming the base model significantly.

Background

Major depressive disorder is a leading global health concern, with symptom assessment often relying on structured clinical interviews such as the MADRS. Traditional AI methods have limitations in capturing the semantic context of language-based data, which is critical for nuanced depressive symptom evaluation. Large language models (LLMs) have revolutionized natural language processing and show promise in mental health applications, but their ability to replicate specialized clinical reasoning remains limited. This study explores fine-tuning a German BERT-based LLM to predict continuous depressive symptom severity scores from patient interview transcripts.

Data Highlights

Model	Mean Absolute Error (MAE)	Accuracy Range (%)
Fine-tuned MADRS-BERT	0.7–1.0 across items	79–88 (flexible criteria)
Base BERT model	Not specified; predicted all zeros	0 (no specificity)
Baseline mean regression	MAE higher by ~0.9 points than MADRS-BERT	Not specified

Key Findings

Fine-tuning the German BERT model reduced prediction errors by approximately 75% compared to the untrained base model.
The MADRS-BERT model predicted symptom severity scores with mean absolute errors ranging from 0.7 (inner tension) to 1.0 (emotional numbness).
Accuracies under a flexible evaluation criterion (±1 point tolerance) ranged from 79% to 88% across nine depressive symptom items.
The base BERT model without fine-tuning failed to differentiate symptom severity, predicting zero scores exclusively.
Combining real patient interviews with synthetically generated data ensured balanced score distributions for training and validation.

Clinical Implications

The fine-tuned MADRS-BERT model offers a scalable and accurate tool for automated assessment of depressive symptom severity from natural language clinical interviews. Its performance closely matches clinician ratings, supporting its potential use in clinical decision-making and treatment monitoring, especially in low-resource settings where expert evaluation may be limited. This approach may enhance routine depression screening and longitudinal symptom tracking.

Conclusion

Fine-tuning a large language model on structured clinical interview data enables precise automated assessment of depressive symptoms, demonstrating the promise of LLMs as adjunctive tools in mental health care. This method bridges the gap between natural language processing advances and clinical applicability for depression evaluation.

References

Montgomery & Åsberg 1979 -- Development of the MADRS
BERT-base-German-cased model source

Using a fine-tuned large language model for symptom-based depression evaluation

Refined Large Language Model Accurately Assesses Depression via Symptom Analysis

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Using a fine-tuned large language model for symptom-based depression evaluation

Related Content

Strategies for Safeguarding Refugee Children Against Mental Health Issues: A Scoping Review of Alterable Factors for Preventive Measures

The psychedelic revolution is leaving behind people of color

Personalised modelling of routine variability and affective states