ARIG: Autoregressive Interactive Head Generation for Real-time Conversations

Guo, Ying; Liu, Xi; Zhen, Cheng; Yan, Pengfei; Wei, Xiaoming

计算机科学 > 计算机视觉与模式识别

arXiv:2507.00472 (cs)

[提交于 2025年7月1日 ]

标题： ARIG：用于实时对话的自回归交互头生成

标题： ARIG: Autoregressive Interactive Head Generation for Real-time Conversations

Authors:Ying Guo, Xi Liu, Cheng Zhen, Pengfei Yan, Xiaoming Wei

摘要：面对面的交流作为一种常见的活动，推动了交互式头部生成的研究。一个虚拟代理可以根据另一用户和自身的音频或运动信号，同时具备倾听和说话能力来生成运动反应。然而，之前的逐片段生成范式或显式的听者/说话者生成器切换方法在未来的信号获取、上下文行为理解和切换平滑性方面存在局限，使得实时性和真实性难以实现。在本文中，我们提出了一种基于自回归（AR）的逐帧框架称为ARIG，以实现实时生成并提高交互的真实性。为了实现实时生成，我们将运动预测建模为非向量量化AR过程。与离散代码本索引预测不同，我们使用扩散过程表示运动分布，在连续空间中实现了更准确的预测。为了提高交互的真实性，我们强调交互行为理解（IBU）和详细的对话状态理解（CSU）。在IBU中，基于双轨双模态信号，我们通过双向集成学习总结短距离行为，并对长距离进行上下文理解。在CSU中，我们使用语音活动信号和IBU的上下文特征来理解实际对话中存在的各种状态（如打断、反馈、暂停等）。这些作为最终渐进式运动预测的条件。大量实验验证了我们模型的有效性。

摘要： Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic. In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure, achieving more accurate predictions in continuous space. To improve interaction realism, we emphasize interactive behavior understanding (IBU) and detailed conversational state understanding (CSU). In IBU, based on dual-track dual-modal signals, we summarize short-range behaviors through bidirectional-integrated learning and perform contextual understanding over long ranges. In CSU, we use voice activity signals and context features of IBU to understand the various states (interruption, feedback, pause, etc.) that exist in actual conversations. These serve as conditions for the final progressive motion prediction. Extensive experiments have verified the effectiveness of our model.

评论：	ICCV 2025。首页：https://jinyugy21.github.io/ARIG/
主题：	计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2507.00472 [cs.CV]
	(或者 arXiv:2507.00472v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.00472

提交历史

来自： Ying Guo [查看电子邮件]
[v1] 星期二， 2025 年 7 月 1 日 06:38:14 UTC (12,570 KB)

计算机科学 > 计算机视觉与模式识别

标题： ARIG：用于实时对话的自回归交互头生成

标题： ARIG: Autoregressive Interactive Head Generation for Real-time Conversations

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： ARIG：用于实时对话的自回归交互头生成 显示英文标题

标题： ARIG: Autoregressive Interactive Head Generation for Real-time Conversations

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： ARIG：用于实时对话的自回归交互头生成