Evaluating the performance of general purpose large language models in identifying human facial emotions

Category	Detail
Condition	Recognition of human facial expressions using AI
Key Mechanisms	Large language models (LLMs) process multimodal inputs to classify facial emotions
Target Population	Human facial expression datasets (NimStim actors aged 21–30, mostly European American)
Care Setting	Potential applications in behavioral healthcare and human–computer interaction

GPT-4o and Gemini 2.0 Experimental matched or exceeded human performance in facial emotion recognition, especially for calm/neutral and surprise expressions.
All models showed strong agreement with ground truth labels, but fear was frequently misclassified as surprise.
No significant performance differences were found based on actor sex or race, indicating reduced bias in facial emotion recognition.

Utilize LLMs like GPT-4o and Gemini 2.0 Experimental for automated facial emotion recognition to support behavioral health assessments.

Incorporate AI-powered facial expression recognition systems for early detection and monitoring of mental health conditions indicated by subtle expression changes.

Apply LLM-based emotion recognition tools for real-time monitoring of emotional states in clinical or interactive settings.

Be aware of limitations due to dataset demographics (age 21–30, mostly European American) which may affect generalizability.
Consider that static images without verbal context may limit accuracy; multimodal approaches including auditory cues are recommended for future use.

Individuals whose facial expressions may indicate mental health status

AI models can aid in early diagnosis and adaptive interventions by recognizing nuanced facial expressions linked to psychological states.

Use validated and diverse datasets with normative human performance data for training and evaluating AI emotion recognition models.
Prefer general-purpose LLMs with demonstrated high accuracy and reliability (e.g., GPT-4o, Gemini 2.0 Experimental) over less accurate models.
Account for potential misclassification of fear as surprise when interpreting AI emotion recognition outputs.
Complement facial expression analysis with multimodal data (e.g., speech) to improve contextual understanding.
Continuously evaluate AI model performance across diverse populations to ensure equity and reduce bias.

Clinical Scorecard: Assessing the Efficacy of General-Purpose Large Language Models in Recognizing Human Facial Expressions