CLAIP-Emo: Parameter-Efficient Adaptation of Language-supervised models for In-the-Wild Audiovisual Emotion Recognition

Chen, Yin; Li, Jia; Hu, Jinpeng; Hu, Zhenzhen; Hong, Richang

Computer Science > Multimedia

arXiv:2509.14527 (cs)

[Submitted on 18 Sep 2025 ]

Title: CLAIP-Emo: Parameter-Efficient Adaptation of Language-supervised models for In-the-Wild Audiovisual Emotion Recognition

Title: CLAIP-Emo：用于真实场景视听情感识别的语言监督模型的参数高效适应

Authors:Yin Chen, Jia Li, Jinpeng Hu, Zhenzhen Hu, Richang Hong

Abstract: Audiovisual emotion recognition (AVER) in the wild is still hindered by pose variation, occlusion, and background noise. Prevailing methods primarily rely on large-scale domain-specific pre-training, which is costly and often mismatched to real-world affective data. To address this, we present CLAIP-Emo, a modular framework that reframes in-the-wild AVER as a parameter-efficient adaptation of language-supervised foundation models (CLIP/CLAP). Specifically, it (i) preserves language-supervised priors by freezing CLIP/CLAP backbones and performing emotion-oriented adaptation via LoRA (updating \ensuremath{\le}4.0\% of the total parameters), (ii) allocates temporal modeling asymmetrically, employing a lightweight Transformer for visual dynamics while applying mean pooling for audio prosody, and (iii) applies a simple fusion head for prediction. On DFEW and MAFW, CLAIP-Emo (ViT-L/14) achieves 80.14\% and 61.18\% weighted average recall with only 8M training parameters, setting a new state of the art. Our findings suggest that parameter-efficient adaptation of language-supervised foundation models provides a scalable alternative to domain-specific pre-training for real-world AVER. The code and models will be available at \href{https://github.com/MSA-LMC/CLAIP-Emo}{https://github.com/MSA-LMC/CLAIP-Emo}.

Abstract: 音频视觉情感识别（AVER）在野外仍然受到姿态变化、遮挡和背景噪声的阻碍。现有方法主要依赖于大规模领域特定的预训练，这成本高昂且常常与现实世界的情感数据不匹配。为了解决这个问题，我们提出了CLAIP-Emo，一种模块化框架，将野外AVER重新构造成语言监督基础模型（CLIP/CLAP）的参数高效适应。具体来说，它（i）通过冻结CLIP/CLAP主干并使用LoRA进行情感导向适应来保留语言监督先验知识（更新总参数的\ensuremath{\le}4.0%），（ii）非对称地分配时间建模，采用轻量级Transformer处理视觉动态，同时对音频语调应用平均池化，（iii）应用一个简单的融合头进行预测。在DFEW和MAFW上，CLAIP-Emo（ViT-L/14）仅使用8M训练参数就实现了80.14%和61.18%的加权平均召回率，创下了新的最先进水平。我们的研究结果表明，语言监督基础模型的参数高效适应为现实世界的AVER提供了一种可扩展的替代方案。代码和模型将在\href{https://github.com/MSA-LMC/CLAIP-Emo}{https://github.com/MSA-LMC/CLAIP-Emo}处提供。

Comments:	The code and models will be available at https://github.com/MSA-LMC/CLAIP-Emo
Subjects:	Multimedia (cs.MM) ; Sound (cs.SD)
Cite as:	arXiv:2509.14527 [cs.MM]
	(or arXiv:2509.14527v1 [cs.MM] for this version)
	https://doi.org/10.48550/arXiv.2509.14527

Submission history

From: Yin Chen [view email]
[v1] Thu, 18 Sep 2025 01:45:44 UTC (693 KB)

Computer Science > Multimedia

Title: CLAIP-Emo: Parameter-Efficient Adaptation of Language-supervised models for In-the-Wild Audiovisual Emotion Recognition

Title: CLAIP-Emo：用于真实场景视听情感识别的语言监督模型的参数高效适应

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Multimedia

Title: CLAIP-Emo: Parameter-Efficient Adaptation of Language-supervised models for In-the-Wild Audiovisual Emotion Recognition Show Chinese title

Title: CLAIP-Emo：用于真实场景视听情感识别的语言监督模型的参数高效适应

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: CLAIP-Emo: Parameter-Efficient Adaptation of Language-supervised models for In-the-Wild Audiovisual Emotion Recognition