查看 最近的 文章
多模态情感识别在情感唤醒估计中常常由于噪声以及音频和视觉模态之间的错位而导致性能下降。 为了解决这一挑战,我们引入了TAGF,这是一种时间感知的门控融合框架,用于多模态情感识别。 TAGF根据时间动态自适应地调节递归注意力输出的贡献。 具体来说,TAGF结合了一个基于BiLSTM的时间门控机制,以学习每个递归步骤的相对重要性,并有效整合多步跨模态特征。 通过将时间感知嵌入到递归融合过程中,TAGF能够有效捕捉情感表达的序列演变以及模态之间的复杂相互作用。 在Aff-Wild2数据集上的实验结果表明,TAGF相比现有的基于递归注意力的模型表现出具有竞争力的性能。 此外,TAGF对跨模态错位具有很强的鲁棒性,并且能够在真实条件下可靠地建模动态情感转换。
Multimodal emotion recognition often suffers from performance degradation in valence-arousal estimation due to noise and misalignment between audio and visual modalities. To address this challenge, we introduce TAGF, a Time-aware Gated Fusion framework for multimodal emotion recognition. The TAGF adaptively modulates the contribution of recursive attention outputs based on temporal dynamics. Specifically, the TAGF incorporates a BiLSTM-based temporal gating mechanism to learn the relative importance of each recursive step and effectively integrates multistep cross-modal features. By embedding temporal awareness into the recursive fusion process, the TAGF effectively captures the sequential evolution of emotional expressions and the complex interplay between modalities. Experimental results on the Aff-Wild2 dataset demonstrate that TAGF achieves competitive performance compared with existing recursive attention-based models. Furthermore, TAGF exhibits strong robustness to cross-modal misalignment and reliably models dynamic emotional transitions in real-world conditions.
由于强大的自然语言处理和生成能力,大型语言模型(LLM)代理已成为通过用户模拟增强推荐系统的一种有前景的解决方案。 然而,在视频推荐领域,现有研究主要依赖于使用冻结LLM的基于提示的模拟,并面临多模态内容理解的复杂挑战。 这经常导致次优的项目建模和用户偏好学习,从而最终限制推荐性能。 为了解决这些挑战,我们引入了 VRAgent-R1,一种新的基于代理的范式,在用户模拟中融入类人智能。 具体来说,VRAgent-R1包括两个不同的代理:项目感知(IP)代理和用户模拟(US) 代理,用于交互式用户-项目建模。 首先,IP代理基于MLLMs模拟类人的渐进式思维,有效捕捉视频中的隐藏推荐语义。 通过IP代理提供的更全面的多模态内容理解,视频推荐系统能够提供更高质量的候选项目。 随后,US 代理基于深入的思维链(CoT)推理来优化推荐视频集,并通过强化学习更好地与真实用户偏好对齐。 在大规模视频推荐基准上的实验结果证明了我们提出的 VRAgent-R1方法的有效性,例如,IP代理在MicroLens-100k数据集上的NDCG@10指标提高了6.0%,而US代理在用户决策模拟中的准确性比最先进的基线高出约45.0%。
Owing to powerful natural language processing and generative capabilities, large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. However, in the realm of video recommendation, existing studies predominantly resort to prompt-based simulation using frozen LLMs and encounter the intricate challenge of multimodal content understanding. This frequently results in suboptimal item modeling and user preference learning, thereby ultimately constraining recommendation performance. To address these challenges, we introduce VRAgent-R1, a novel agent-based paradigm that incorporates human-like intelligence in user simulation. Specifically, VRAgent-R1 comprises two distinct agents: the Item Perception (IP) Agent and the User Simulation (US) Agent, designed for interactive user-item modeling. Firstly, the IP Agent emulates human-like progressive thinking based on MLLMs, effectively capturing hidden recommendation semantics in videos. With a more comprehensive multimodal content understanding provided by the IP Agent, the video recommendation system is equipped to provide higher-quality candidate items. Subsequently, the US Agent refines the recommended video sets based on in-depth chain-of-thought (CoT) reasoning and achieves better alignment with real user preferences through reinforcement learning. Experimental results on a large-scale video recommendation benchmark have demonstrated the effectiveness of our proposed VRAgent-R1 method, e.g., the IP Agent achieves a 6.0\% improvement in NDCG@10 on the MicroLens-100k dataset, while the US Agent shows approximately 45.0\% higher accuracy in user decision simulation compared to state-of-the-art baselines.
不公平是推荐系统(RSs)中一个众所周知的挑战,通常会导致有偏的结果,根据性别、种族、年龄或流行度等属性对用户或项目造成不利影响。 尽管一些方法已经开始在离线或静态环境中改进公平性推荐,但不公平问题往往随时间加剧,导致诸如马太效应、过滤气泡和回声室等重大问题。 为了解决这些挑战,我们提出了一种新框架,用于公平对话推荐系统(CRS)的超图对比多兴趣学习(HyFairCRS),旨在促进动态和交互式对话推荐系统中的多兴趣多样性公平性。 HyFairCRS首先通过对比学习建立多样化的超图,以捕捉广泛的用户兴趣。 然后这些兴趣被用于对话中,生成信息丰富的回复,并在动态的用户-系统反馈循环中确保公平的项目预测。 在两个基于CRS的数据集上的实验表明,HyFairCRS在有效缓解不公平的同时实现了新的最先进性能。 我们的代码可在 https://github.com/zysensmile/HyFairCRS 获取。
Unfairness is a well-known challenge in Recommender Systems (RSs), often resulting in biased outcomes that disadvantage users or items based on attributes such as gender, race, age, or popularity. Although some approaches have started to improve fairness recommendation in offline or static contexts, the issue of unfairness often exacerbates over time, leading to significant problems like the Matthew effect, filter bubbles, and echo chambers. To address these challenges, we proposed a novel framework, Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System (HyFairCRS), aiming to promote multi-interest diversity fairness in dynamic and interactive Conversational Recommender Systems (CRSs). HyFairCRS first captures a wide range of user interests by establishing diverse hypergraphs through contrastive learning. These interests are then utilized in conversations to generate informative responses and ensure fair item predictions within the dynamic user-system feedback loop. Experiments on two CRS-based datasets show that HyFairCRS achieves a new state-of-the-art performance while effectively alleviating unfairness. Our code is available at https://github.com/zysensmile/HyFairCRS.
视频到音频(V2A)生成取得了显著进展,并在电影和视频后期制作中发挥着关键作用。 然而,当前的方法忽视了电影语言,这是电影制作中艺术表达的关键组成部分。 因此,在Foley目标仅部分可见的场景中,它们的性能会下降。 为了解决这个挑战,我们提出了一种简单的自我蒸馏方法,以将V2A模型扩展到电影语言场景。 通过模拟电影语言的变化,学生模型学习将训练对的视频特征与相同的音画对应关系对齐,使其能够有效地捕捉声音与部分视觉信息之间的关联。 我们的方法不仅在所有评估指标下在部分可见性情况下实现了令人印象深刻的改进,而且还在大规模的V2A数据集VGGSound上提升了性能。
Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.
真实世界扫描的3D人体动态网格的压缩是一个新兴的研究领域,由远程存在、虚拟现实和3D数字流等应用所推动。 与具有固定拓扑结构的合成动态网格不同,扫描的动态网格不仅在帧之间具有变化的拓扑结构,还存在如孔洞和异常点之类的扫描缺陷,增加了预测和压缩的复杂性。 此外,人体网格通常结合刚性和非刚性运动,与仅表现出纯刚性运动的物体相比,准确的预测和编码要困难得多。 为了解决这些挑战,我们提出了一种针对真实世界扫描的人体动态网格的压缩方法,利用嵌入的关键节点。 每个顶点的时间运动被表述为相邻关键节点变换的距离加权组合,只需要传输关键节点的变换。 为了提高KeyNode驱动的预测质量,我们引入了一种基于八叉树的残差编码方案和一种双向预测模式,该模式使用两个方向的I帧。 大量实验表明,我们的方法在最先进的方法上取得了显著改进,在评估序列中平均比特率节省了58.43%,特别是在低比特率下表现尤为出色。
The compression of real-world scanned 3D human dynamic meshes is an emerging research area, driven by applications such as telepresence, virtual reality, and 3D digital streaming. Unlike synthesized dynamic meshes with fixed topology, scanned dynamic meshes often not only have varying topology across frames but also scan defects such as holes and outliers, increasing the complexity of prediction and compression. Additionally, human meshes often combine rigid and non-rigid motions, making accurate prediction and encoding significantly more difficult compared to objects that exhibit purely rigid motion. To address these challenges, we propose a compression method designed for real-world scanned human dynamic meshes, leveraging embedded key nodes. The temporal motion of each vertex is formulated as a distance-weighted combination of transformations from neighboring key nodes, requiring the transmission of solely the key nodes' transformations. To enhance the quality of the KeyNode-driven prediction, we introduce an octree-based residual coding scheme and a Dual-direction prediction mode, which uses I-frames from both directions. Extensive experiments demonstrate that our method achieves significant improvements over the state-of-the-art, with an average bitrate savings of 58.43% across the evaluated sequences, particularly excelling at low bitrates.