Using a fine-tuned large language model for symptom-based depression evaluation - Scorecard - MDSpire

Using a fine-tuned large language model for symptom-based depression evaluation

By
Samantha Weber
Nicolas Deperrois
Robert Heun
Laura Frühschütz
Anna Monn
Stephanie Homan
Andrea Häfliger
Erich Seifritz
Tobias Kowatsch
Birgit Kleim
Sebastian Olbrich
October 7, 2025
0 min

Npj Digital Medicine

Share

Clinical Scorecard: Employing a refined large language model for assessing depression through symptom analysis

At a Glance

Category	Detail
Condition	Major depressive disorder
Key Mechanisms	Fine-tuned German BERT-based large language model predicts Montgomery-Åsberg Depression Rating Scale (MADRS) scores from patient interview transcripts using regression
Target Population	Transdiagnostic patients with depressive symptoms
Care Setting	Clinical and low-resource mental health settings

Key Highlights

Fine-tuned MADRS-BERT model predicts individual MADRS symptom severity scores with mean absolute error between 0.7 and 1.0
Model accuracy ranges from 79% to 88% across nine depressive symptom items under flexible evaluation criteria
Fine-tuning reduces prediction errors by approximately 75% compared to untrained base model

Guideline-Based Recommendations

Diagnosis

Use structured clinical interviews such as MADRS for standardized depressive symptom assessment
Incorporate natural language processing tools like fine-tuned LLMs to assist in symptom severity quantification

Management

Employ automated LLM-based assessments to support clinical decision-making and monitor treatment progress
Utilize combined real and synthetic interview data to improve model robustness

Monitoring & Follow-up

Apply LLM predictions longitudinally to track changes in depressive symptom severity
Consider ±1 point tolerance in symptom rating discrepancies for clinical relevance

Risks

Base LLMs without task-specific fine-tuning may lack specificity and fail to differentiate symptom severity
Non-verbal cues (e.g., Apparent Sadness) are not captured by language-based models and require clinician assessment

Patient & Prescribing Data

Patients undergoing structured clinical interviews for depression

Automated symptom severity scoring via LLMs can complement clinician ratings and potentially enhance monitoring in resource-limited settings

Clinical Best Practices

Fine-tune language models on domain-specific clinical data to improve prediction accuracy
Combine real patient data with synthetic data to balance symptom severity distributions during model training
Use regression approaches to capture continuous symptom severity rather than categorical classification
Interpret LLM outputs within clinical context, acknowledging limitations in non-verbal symptom detection

References

Original Source(s)

Npj Digital Medicine

Using a fine-tuned large language model for symptom-based depression evaluation

by Samantha Weber, Nicolas Deperrois, Robert Heun, Laura Frühschütz, Anna Monn, Stephanie Homan, Andrea Häfliger, Erich Seifritz, Tobias Kowatsch, Birgit Kleim, Sebastian Olbrich
October 7, 2025

Related Content

Frontiers In Psychiatry

Is access to euthanasia drugs and moral stress linked to suicide rates in veterinarians? A cross-sectional national survey and network analysis

by J. Rymaszewska, K. Fila-Pawłowska, D. Szcześniak, W. Hildebrand, E. Pawłowska, M. Magdziarz
May 20, 2026

Npj Digital Medicine

Virtual nature, real relief: how exposure to virtual natural environments reduces anxiety, stress, and depression in healthy adults

by Lunxin Chen, Ruixiang Yan, Jialiang Yu
November 18, 2025

Frontiers In Psychiatry

Linking mental health and digital addiction in general population: an overview of thematic evolution and trends from a science mapping perspective