HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception

Yao, Wei; Sun, Yunlian; Zhang, Hongwen; Liu, Yebin; Tang, Jinhui

计算机科学 > 计算机视觉与模式识别

arXiv:2506.01579 (cs)

[提交于 2025年6月2日 ]

标题：高交互：基于分层场景感知的全身人-物-场景交互生成

标题： HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception

Authors:Wei Yao, Yunlian Sun, Hongwen Zhang, Yebin Liu, Jinhui Tang

摘要：在计算机图形学和动画中，生成高保真的全身人体与动态物体及静态场景的交互仍然是一项关键挑战。现有用于人体-物体交互的方法通常忽略场景上下文，导致不合理的穿透现象，而人体-场景交互方法则难以协调精细操作与远距离导航。为了解决这些局限性，我们提出了HOSIG，这是一种新颖的框架，通过分层场景感知来合成全身交互。我们的方法将任务分解为三个关键组件：1）一个场景感知的抓取姿态生成器，通过整合局部几何约束确保无碰撞的全身姿势和精确的手-物体接触；2）一种启发式导航算法，通过压缩的二维楼层地图和双组件空间推理，自主规划复杂室内环境中的避障路径；3）一个场景引导的运动扩散模型，通过结合空间锚点和双空间分类器自由指导，生成轨迹控制的全身运动，并达到手指级别的精度。在TRUMANS数据集上的广泛实验表明，我们的方法优于最先进的技术。值得注意的是，我们的框架通过自回归生成支持无限长度的动作，并且需要最少的人工干预。这项工作弥合了场景感知导航与灵巧物体操作之间的关键差距，推动了具身交互合成的前沿发展。代码将在发表后提供。项目页面：http://yw0208.github.io/hosig

摘要： Generating high-fidelity full-body human interactions with dynamic objects and static scenes remains a critical challenge in computer graphics and animation. Existing methods for human-object interaction often neglect scene context, leading to implausible penetrations, while human-scene interaction approaches struggle to coordinate fine-grained manipulations with long-range navigation. To address these limitations, we propose HOSIG, a novel framework for synthesizing full-body interactions through hierarchical scene perception. Our method decouples the task into three key components: 1) a scene-aware grasp pose generator that ensures collision-free whole-body postures with precise hand-object contact by integrating local geometry constraints, 2) a heuristic navigation algorithm that autonomously plans obstacle-avoiding paths in complex indoor environments via compressed 2D floor maps and dual-component spatial reasoning, and 3) a scene-guided motion diffusion model that generates trajectory-controlled, full-body motions with finger-level accuracy by incorporating spatial anchors and dual-space classifier-free guidance. Extensive experiments on the TRUMANS dataset demonstrate superior performance over state-of-the-art methods. Notably, our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention. This work bridges the critical gap between scene-aware navigation and dexterous object manipulation, advancing the frontier of embodied interaction synthesis. Codes will be available after publication. Project page: http://yw0208.github.io/hosig

主题：	计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2506.01579 [cs.CV]
	(或者 arXiv:2506.01579v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.01579

提交历史

来自： Wei Yao [查看电子邮件]
[v1] 星期一， 2025 年 6 月 2 日 12:08:08 UTC (3,856 KB)

计算机科学 > 计算机视觉与模式识别

标题：高交互：基于分层场景感知的全身人-物-场景交互生成

标题： HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： 高交互：基于分层场景感知的全身人-物-场景交互生成 显示英文标题

标题： HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：高交互：基于分层场景感知的全身人-物-场景交互生成