Multimodal machine learning for video based single question mental health assessment

By
Bradley Grimm
Pernille Yilmam
Brett Talbot
Loren Larsen
December 16, 2025
0 min

Npj Digital Medicine

At a Glance

Category	Detail
Condition	Depression, Anxiety, and PTSD
Key Mechanisms	Multimodal machine learning model integrating textual (MPNet) and voice prosody (HuBERT) analysis of a single open-ended video response
Target Population	Adults undergoing mental health screening across diverse demographic groups
Care Setting	Primary care and clinical settings with limited mental health provider availability

Key Highlights

A single open-ended question predicts PHQ-9, GAD-7, and PCL-5 scores simultaneously with comparable accuracy to traditional questionnaires.
The approach reduces assessment time by 64.6% (78.4 seconds vs. 221.7 seconds) compared to sequential administration of three separate surveys.
High user acceptance with 90.7% willing to use video-based screening and only 1.4% unwilling to participate.

Guideline-Based Recommendations

Diagnosis

Use a single validated open-ended question: 'In the last 2 weeks have you felt down, nervous, depressed, anxious, hopeless or on edge? If so, please explain in detail how it has bothered you or impacted your life?'
Employ multimodal analysis combining semantic text and voice prosody features to predict depression, anxiety, and PTSD symptom severity.

Management

Integrate efficient multi-condition screening to reduce patient burden and improve engagement in high-volume or longitudinal clinical settings.
Use video-based screening tools to facilitate remote or in-person mental health assessments.

Monitoring & Follow-up

Monitor mental health symptoms longitudinally using the single-question multimodal approach to track changes efficiently over time.

Risks

Be aware that trauma exposure is not directly assessed; the model predicts current PTSD symptom severity but not trauma history.
Ensure demographic consistency and validate model performance across diverse populations to avoid bias.

Patient & Prescribing Data

2420 participants from five independent cohorts, diverse in age, gender, and race/ethnicity

The single-question multimodal model demonstrates strong predictive performance and high acceptability, supporting its use as a scalable screening tool to guide further clinical evaluation and treatment planning.

Clinical Best Practices

Adopt multimodal machine learning tools combining text and voice analysis for efficient mental health screening.
Use a single validated open-ended question to reduce assessment time and patient fatigue.
Ensure screening tools are validated across demographic groups to maintain reliability and equity.
Incorporate video-based assessments to enhance engagement and accessibility in clinical workflows.

Multimodal machine learning for video based single question mental health assessment

Clinical Scorecard: Integrated Machine Learning Approaches for Video-Based Assessment of Mental Health with a Single Question

At a Glance

Key Highlights

Guideline-Based Recommendations

Diagnosis

Management

Monitoring & Follow-up

Risks

Patient & Prescribing Data

Clinical Best Practices

References

Original Source(s)

Multimodal machine learning for video based single question mental health assessment

Related Content

The psychedelic revolution is leaving behind people of color

Park Features Linked to Depression in Women: A Cross-Sectional Analysis of 329,363 Adults

Digital mental health interventions in Chinese: a scoping review