Evaluating the performance of general purpose large language models in identifying human facial emotions - Summary - MDSpire

Evaluating the performance of general purpose large language models in identifying human facial emotions

  • By

  • Benjamin W. Nelson

  • Ari Winbush

  • Steven Siddals

  • Matthew Flathers

  • Nicholas B. Allen

  • John Torous

  • October 16, 2025

  • 0 min

Share

Objective:

To evaluate the ability of three leading LLMs to recognize human facial expressions using the NimStim dataset, highlighting the significance of their socioemotional competence.

Key Findings:
  • GPT-4o and Gemini 2.0 Experimental matched or exceeded human performance, particularly for calm/neutral and surprise expressions, with GPT-4o achieving the highest overall accuracy.
  • Overall accuracy was 86% for GPT-4o, 84% for Gemini 2.0, and 74% for Claude 3.5, indicating a clear performance hierarchy.
  • Fear was frequently misclassified as surprise across models, highlighting a common area of error.
Interpretation:

The findings indicate that LLMs are developing socioemotional competence, with potential applications in healthcare for recognizing mental health conditions such as depression and anxiety.

Limitations:
  • All stimuli were static images, limiting generalizability to dynamic expressions.
  • Actors were predominantly aged 21-30 and European American, which may affect results and applicability to diverse populations.
  • The study relied on a single dataset, which may limit broader applicability and necessitates further validation across varied datasets.
Conclusion:

While LLMs show promise in facial expression recognition, further research is needed to enhance generalizability and explore multimodal emotion classification, particularly in diverse contexts.

Original Source(s)

Related Content