Evaluating the performance of general purpose large language models in identifying human facial emotions

By
Benjamin W. Nelson
Ari Winbush
Steven Siddals
Matthew Flathers
Nicholas B. Allen
John Torous
October 16, 2025
0 min

Npj Digital Medicine

Overview

This study assessed three leading large language models (LLMs)—ChatGPT 4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet—on their ability to recognize human facial expressions using the NimStim dataset. ChatGPT 4o and Gemini matched or exceeded human performance with high accuracy and agreement, particularly for calm/neutral and surprise expressions, while all models struggled most with fear recognition.

Background

Facial expressions are critical indicators of human emotions and psychological states, playing a vital role in social-emotional functioning and human–computer interactions. Large language models have expanded beyond text to multimodal inputs, showing promise in social cognition tasks, including emotion recognition. Accurate AI interpretation of facial expressions could enhance behavioral healthcare by enabling earlier diagnosis and real-time monitoring of mental health conditions. However, variability in expression interpretation across cultures and contexts necessitates rigorous evaluation using validated datasets and diverse actor demographics.

Data Highlights

Model	Cohen's Kappa (κ)	Accuracy (%)
ChatGPT 4o	0.83 (95% CI: 0.80–0.85)	86 (95% CI: 84–89)
Gemini 2.0 Experimental	0.81 (95% CI: 0.77–0.84)	84 (95% CI: 81–87)
Claude 3.5 Sonnet	0.70 (95% CI: 0.67–0.74)	74 (95% CI: 71–78)

Key Findings

ChatGPT 4o and Gemini 2.0 Experimental demonstrated almost perfect agreement with ground truth labels, comparable to or exceeding human raters.
All models showed strong performance on Happy, Calm/Neutral, and Surprise expressions but frequently misclassified Fear as Surprise.
Claude 3.5 Sonnet had lower overall accuracy and agreement compared to the other two models.
No significant differences in model performance were observed based on the sex or race of the actors.
ChatGPT 4o outperformed Claude 3.5 Sonnet on several emotions including Calm/Neutral, Sad, Disgust, and Surprise.
Compared to prior convolutional neural network models on the same dataset, these LLMs showed substantially higher accuracy without specialized training.

Clinical Implications

The demonstrated ability of general-purpose LLMs to accurately recognize facial expressions suggests potential applications in behavioral healthcare, such as early detection and monitoring of mental health conditions through subtle expression changes. Their consistent performance across diverse actor demographics supports equitable use in clinical settings. However, limitations including static image stimuli and demographic homogeneity highlight the need for further validation with dynamic and culturally diverse datasets.

Conclusion

General-purpose large language models, particularly ChatGPT 4o and Gemini 2.0 Experimental, exhibit strong socioemotional competence in facial expression recognition, rivaling human performance. These findings support their potential integration into healthcare technologies for enhanced emotion-sensitive applications.

References

Study on LLMs and Facial Expression Recognition, 2024 -- Assessing the Efficacy of General-Purpose Large Language Models in Recognizing Human Facial Expressions

Evaluating the performance of general purpose large language models in identifying human facial emotions

Efficacy of Large Language Models in Recognizing Human Facial Expressions

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Evaluating the performance of general purpose large language models in identifying human facial emotions

Related Content

Exploring the Role of Digital Phenotyping in Anticipating Depressive Symptoms During the Peripartum Period

Asian American Women Healthcare Professionals’ Experiences of Workplace Bias

Active components in digital health interventions for sleep among adolescents: a systematic review and meta-analysis of randomized controlled trials