Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

Xing, Hao; Boey, Kai Zhe; Wu, Yuankai; Burschka, Darius; Cheng, Gordon

计算机科学 > 计算机视觉与模式识别

arXiv:2507.00752 (cs)

[提交于 2025年7月1日 ]

标题：具有正弦编码的多模态图卷积网络用于鲁棒的人类动作分割

标题： Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

Authors:Hao Xing, Kai Zhe Boey, Yuankai Wu, Darius Burschka, Gordon Cheng

摘要：人体动作的精确时间分割对于协作环境中的智能机器人至关重要，其中对子活动标签及其时间结构的准确理解是必不可少的。然而，人体姿态估计和目标检测中的固有噪声常常导致过度分割错误，破坏动作序列的一致性。为了解决这个问题，我们提出了一种多模态图卷积网络（MMGCN），将低帧率（例如1 fps）的视觉数据与高帧率（例如30 fps）的运动数据（骨骼和目标检测）相结合，以减轻碎片化问题。我们的框架引入了三个关键贡献。首先，一种正弦编码策略，将3D骨骼坐标映射到连续的正弦-余弦空间，以增强空间表示的鲁棒性。其次，一个时间图融合模块，通过分层特征聚合将具有不同分辨率的多模态输入对齐。第三，受人类动作固有平滑过渡的启发，我们设计了SmoothLabelMix，这是一种数据增强技术，通过混合输入序列和标签生成具有渐进动作过渡的合成训练样本，从而提高预测的时间一致性并减少过度分割伪影。在双手动作数据集上的广泛实验表明，我们的方法优于最先进的方法，尤其是在动作分割准确性方面，达到了F1@10: 94.5%和F1@25: 92.8%。

摘要： Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.

评论：	7页，4张图，已被IROS25接受，中国杭州
主题：	计算机视觉与模式识别 (cs.CV) ; 机器人技术 (cs.RO)
引用方式：	arXiv:2507.00752 [cs.CV]
	(或者 arXiv:2507.00752v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.00752

提交历史

来自： Hao Xing [查看电子邮件]
[v1] 星期二， 2025 年 7 月 1 日 13:55:57 UTC (3,679 KB)

计算机科学 > 计算机视觉与模式识别

标题：具有正弦编码的多模态图卷积网络用于鲁棒的人类动作分割

标题： Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： 具有正弦编码的多模态图卷积网络用于鲁棒的人类动作分割 显示英文标题

标题： Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：具有正弦编码的多模态图卷积网络用于鲁棒的人类动作分割