Using a fine-tuned large language model for symptom-based depression evaluation - Scorecard - MDSpire

Using a fine-tuned large language model for symptom-based depression evaluation

  • By

  • Samantha Weber

  • Nicolas Deperrois

  • Robert Heun

  • Laura Frühschütz

  • Anna Monn

  • Stephanie Homan

  • Andrea Häfliger

  • Erich Seifritz

  • Tobias Kowatsch

  • Birgit Kleim

  • Sebastian Olbrich

  • October 7, 2025

  • 0 min

Share

Clinical Scorecard: Employing a refined large language model for assessing depression through symptom analysis

At a Glance

CategoryDetail
ConditionMajor depressive disorder
Key MechanismsFine-tuned German BERT-based large language model predicts Montgomery-Åsberg Depression Rating Scale (MADRS) scores from patient interview transcripts using regression
Target PopulationTransdiagnostic patients with depressive symptoms
Care SettingClinical and low-resource mental health settings

Key Highlights

  • Fine-tuned MADRS-BERT model predicts individual MADRS symptom severity scores with mean absolute error between 0.7 and 1.0
  • Model accuracy ranges from 79% to 88% across nine depressive symptom items under flexible evaluation criteria
  • Fine-tuning reduces prediction errors by approximately 75% compared to untrained base model

Guideline-Based Recommendations

Diagnosis

  • Use structured clinical interviews such as MADRS for standardized depressive symptom assessment
  • Incorporate natural language processing tools like fine-tuned LLMs to assist in symptom severity quantification

Management

  • Employ automated LLM-based assessments to support clinical decision-making and monitor treatment progress
  • Utilize combined real and synthetic interview data to improve model robustness

Monitoring & Follow-up

  • Apply LLM predictions longitudinally to track changes in depressive symptom severity
  • Consider ±1 point tolerance in symptom rating discrepancies for clinical relevance

Risks

  • Base LLMs without task-specific fine-tuning may lack specificity and fail to differentiate symptom severity
  • Non-verbal cues (e.g., Apparent Sadness) are not captured by language-based models and require clinician assessment

Patient & Prescribing Data

Patients undergoing structured clinical interviews for depression

Automated symptom severity scoring via LLMs can complement clinician ratings and potentially enhance monitoring in resource-limited settings

Clinical Best Practices

  • Fine-tune language models on domain-specific clinical data to improve prediction accuracy
  • Combine real patient data with synthetic data to balance symptom severity distributions during model training
  • Use regression approaches to capture continuous symptom severity rather than categorical classification
  • Interpret LLM outputs within clinical context, acknowledging limitations in non-verbal symptom detection

References

Original Source(s)

Related Content