Evaluating the performance of general purpose large language models in identifying human facial emotions - Scorecard - MDSpire

Evaluating the performance of general purpose large language models in identifying human facial emotions

  • By

  • Benjamin W. Nelson

  • Ari Winbush

  • Steven Siddals

  • Matthew Flathers

  • Nicholas B. Allen

  • John Torous

  • October 16, 2025

  • 0 min

Share

Clinical Scorecard: Assessing the Efficacy of General-Purpose Large Language Models in Recognizing Human Facial Expressions

At a Glance

CategoryDetail
ConditionRecognition of human facial expressions using AI
Key MechanismsLarge language models (LLMs) process multimodal inputs to classify facial emotions
Target PopulationHuman facial expression datasets (NimStim actors aged 21–30, mostly European American)
Care SettingPotential applications in behavioral healthcare and human–computer interaction

Key Highlights

  • GPT-4o and Gemini 2.0 Experimental matched or exceeded human performance in facial emotion recognition, especially for calm/neutral and surprise expressions.
  • All models showed strong agreement with ground truth labels, but fear was frequently misclassified as surprise.
  • No significant performance differences were found based on actor sex or race, indicating reduced bias in facial emotion recognition.

Guideline-Based Recommendations

Diagnosis

  • Utilize LLMs like GPT-4o and Gemini 2.0 Experimental for automated facial emotion recognition to support behavioral health assessments.

Management

  • Incorporate AI-powered facial expression recognition systems for early detection and monitoring of mental health conditions indicated by subtle expression changes.

Monitoring & Follow-up

  • Apply LLM-based emotion recognition tools for real-time monitoring of emotional states in clinical or interactive settings.

Risks

  • Be aware of limitations due to dataset demographics (age 21–30, mostly European American) which may affect generalizability.
  • Consider that static images without verbal context may limit accuracy; multimodal approaches including auditory cues are recommended for future use.

Patient & Prescribing Data

Individuals whose facial expressions may indicate mental health status

AI models can aid in early diagnosis and adaptive interventions by recognizing nuanced facial expressions linked to psychological states.

Clinical Best Practices

  • Use validated and diverse datasets with normative human performance data for training and evaluating AI emotion recognition models.
  • Prefer general-purpose LLMs with demonstrated high accuracy and reliability (e.g., GPT-4o, Gemini 2.0 Experimental) over less accurate models.
  • Account for potential misclassification of fear as surprise when interpreting AI emotion recognition outputs.
  • Complement facial expression analysis with multimodal data (e.g., speech) to improve contextual understanding.
  • Continuously evaluate AI model performance across diverse populations to ensure equity and reduce bias.

References

Original Source(s)

Related Content