Using a fine-tuned large language model for symptom-based depression evaluation - Report - MDSpire

Using a fine-tuned large language model for symptom-based depression evaluation

  • By

  • Samantha Weber

  • Nicolas Deperrois

  • Robert Heun

  • Laura Frühschütz

  • Anna Monn

  • Stephanie Homan

  • Andrea Häfliger

  • Erich Seifritz

  • Tobias Kowatsch

  • Birgit Kleim

  • Sebastian Olbrich

  • October 7, 2025

  • 0 min

Share

Refined Large Language Model Accurately Assesses Depression via Symptom Analysis

Overview

A fine-tuned German BERT-based large language model (MADRS-BERT) accurately predicts individual Montgomery-Åsberg Depression Rating Scale (MADRS) scores from clinical interview transcripts. The model achieved mean absolute errors between 0.7 and 1.0 and accuracies of 79–88%, closely matching clinician ratings and outperforming the base model significantly.

Background

Major depressive disorder is a leading global health concern, with symptom assessment often relying on structured clinical interviews such as the MADRS. Traditional AI methods have limitations in capturing the semantic context of language-based data, which is critical for nuanced depressive symptom evaluation. Large language models (LLMs) have revolutionized natural language processing and show promise in mental health applications, but their ability to replicate specialized clinical reasoning remains limited. This study explores fine-tuning a German BERT-based LLM to predict continuous depressive symptom severity scores from patient interview transcripts.

Data Highlights

ModelMean Absolute Error (MAE)Accuracy Range (%)
Fine-tuned MADRS-BERT0.7–1.0 across items79–88 (flexible criteria)
Base BERT modelNot specified; predicted all zeros0 (no specificity)
Baseline mean regressionMAE higher by ~0.9 points than MADRS-BERTNot specified

Key Findings

  • Fine-tuning the German BERT model reduced prediction errors by approximately 75% compared to the untrained base model.
  • The MADRS-BERT model predicted symptom severity scores with mean absolute errors ranging from 0.7 (inner tension) to 1.0 (emotional numbness).
  • Accuracies under a flexible evaluation criterion (±1 point tolerance) ranged from 79% to 88% across nine depressive symptom items.
  • The base BERT model without fine-tuning failed to differentiate symptom severity, predicting zero scores exclusively.
  • Combining real patient interviews with synthetically generated data ensured balanced score distributions for training and validation.

Clinical Implications

The fine-tuned MADRS-BERT model offers a scalable and accurate tool for automated assessment of depressive symptom severity from natural language clinical interviews. Its performance closely matches clinician ratings, supporting its potential use in clinical decision-making and treatment monitoring, especially in low-resource settings where expert evaluation may be limited. This approach may enhance routine depression screening and longitudinal symptom tracking.

Conclusion

Fine-tuning a large language model on structured clinical interview data enables precise automated assessment of depressive symptoms, demonstrating the promise of LLMs as adjunctive tools in mental health care. This method bridges the gap between natural language processing advances and clinical applicability for depression evaluation.

References

  1. Montgomery & Åsberg 1979 -- Development of the MADRS
  2. BERT-base-German-cased model source

Original Source(s)

Related Content