Evaluating the performance of general purpose large language models in identifying human facial emotions - Report - MDSpire

Evaluating the performance of general purpose large language models in identifying human facial emotions

  • By

  • Benjamin W. Nelson

  • Ari Winbush

  • Steven Siddals

  • Matthew Flathers

  • Nicholas B. Allen

  • John Torous

  • October 16, 2025

  • 0 min

Share

Efficacy of Large Language Models in Recognizing Human Facial Expressions

Overview

This study assessed three leading large language models (LLMs)—ChatGPT 4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet—on their ability to recognize human facial expressions using the NimStim dataset. ChatGPT 4o and Gemini matched or exceeded human performance with high accuracy and agreement, particularly for calm/neutral and surprise expressions, while all models struggled most with fear recognition.

Background

Facial expressions are critical indicators of human emotions and psychological states, playing a vital role in social-emotional functioning and human–computer interactions. Large language models have expanded beyond text to multimodal inputs, showing promise in social cognition tasks, including emotion recognition. Accurate AI interpretation of facial expressions could enhance behavioral healthcare by enabling earlier diagnosis and real-time monitoring of mental health conditions. However, variability in expression interpretation across cultures and contexts necessitates rigorous evaluation using validated datasets and diverse actor demographics.

Data Highlights

ModelCohen's Kappa (κ)Accuracy (%)
ChatGPT 4o0.83 (95% CI: 0.80–0.85)86 (95% CI: 84–89)
Gemini 2.0 Experimental0.81 (95% CI: 0.77–0.84)84 (95% CI: 81–87)
Claude 3.5 Sonnet0.70 (95% CI: 0.67–0.74)74 (95% CI: 71–78)

Key Findings

  • ChatGPT 4o and Gemini 2.0 Experimental demonstrated almost perfect agreement with ground truth labels, comparable to or exceeding human raters.
  • All models showed strong performance on Happy, Calm/Neutral, and Surprise expressions but frequently misclassified Fear as Surprise.
  • Claude 3.5 Sonnet had lower overall accuracy and agreement compared to the other two models.
  • No significant differences in model performance were observed based on the sex or race of the actors.
  • ChatGPT 4o outperformed Claude 3.5 Sonnet on several emotions including Calm/Neutral, Sad, Disgust, and Surprise.
  • Compared to prior convolutional neural network models on the same dataset, these LLMs showed substantially higher accuracy without specialized training.

Clinical Implications

The demonstrated ability of general-purpose LLMs to accurately recognize facial expressions suggests potential applications in behavioral healthcare, such as early detection and monitoring of mental health conditions through subtle expression changes. Their consistent performance across diverse actor demographics supports equitable use in clinical settings. However, limitations including static image stimuli and demographic homogeneity highlight the need for further validation with dynamic and culturally diverse datasets.

Conclusion

General-purpose large language models, particularly ChatGPT 4o and Gemini 2.0 Experimental, exhibit strong socioemotional competence in facial expression recognition, rivaling human performance. These findings support their potential integration into healthcare technologies for enhanced emotion-sensitive applications.

References

  1. Study on LLMs and Facial Expression Recognition, 2024 -- Assessing the Efficacy of General-Purpose Large Language Models in Recognizing Human Facial Expressions

Original Source(s)

Related Content