Automated Depression Detection via Language Analysis: Systematic Review & Meta-Analysis
Overview
This systematic review and meta-analysis evaluated 123 studies using natural language processing and machine learning to detect depression from text. Pooled accuracy across 43 studies was 0.80, with precision 0.78, recall 0.76, and AUC 0.79, indicating promising but heterogeneous performance.
Background
Early identification of depression is crucial for timely intervention and improved outcomes. Advances in natural language processing (NLP) and machine learning (ML) have enabled automated detection of depression from spoken or written language. Despite growing research, the overall diagnostic performance and factors influencing accuracy remain unclear. This review synthesizes existing evidence to assess the effectiveness and limitations of these automated approaches.
Data Highlights
Metric
Number of Studies
Pooled Estimate
Accuracy
43
0.80
Precision
28
0.78
Recall
33
0.76
AUC
14
0.79
Balanced Accuracy
16
0.71
Key Findings
Pooled accuracy of automated depression detection from language was 0.80 across 40,983 text samples.
Precision and recall were 0.78 and 0.76 respectively, indicating balanced performance in identifying true positives.
Area under the curve (AUC) was 0.79, supporting good discriminative ability.
Significant heterogeneity existed, influenced by language, text source, feature type, and classifier used.
Accuracy was highest in studies using structured clinical interviews, non-English languages, and linguistic or embedding-based features.
Text source was the only significant predictor explaining 13.6% of between-study variance in meta-regression.
Clinical Implications
Automated depression detection using NLP and ML shows potential as a supplementary screening tool, especially when applied to structured clinical interviews and diverse languages. However, substantial variability in methods and performance underscores the need for standardized protocols and rigorous validation before clinical implementation. Clinicians should interpret automated results cautiously and in conjunction with comprehensive clinical assessment.
Conclusion
Automated language-based depression detection demonstrates promising accuracy but is limited by heterogeneity and methodological inconsistencies. Future research should focus on standardization and external validation to enable reliable clinical application.