Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-Top Manipulation

Zhang, Chuye; Zhang, Xiaoxiong; Pan, Wei; Zheng, Linfang; Zhang, Wei

Computer Science > Robotics

arXiv:2509.00361 (cs)

[Submitted on 30 Aug 2025 ]

Title: Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-Top Manipulation

Title: 生成性视觉预测在机器人桌面操作中与任务无关的位姿估计相结合

Authors:Chuye Zhang, Xiaoxiong Zhang, Wei Pan, Linfang Zheng, Wei Zhang

Abstract: Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce {GVF-TAPE}, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single side-view RGB image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems.

Abstract: 机器人在非结构化环境中的操作需要能够在多种任务中泛化同时保持强大和可靠性能的系统。我们引入了{GVF-TAPE}，一个闭环框架，将生成式视觉预见与任务无关的姿态估计相结合，以实现可扩展的机器人操作。 GVF-TAPE 使用生成式视频模型从单个侧视RGB图像和任务描述中预测未来的RGB-D帧，提供指导机器人动作的视觉计划。然后，解耦的姿态估计模型从预测的帧中提取末端执行器姿态，并通过低级控制器将其转换为可执行命令。通过在闭环中迭代集成视频预见和姿态估计， GVF-TAPE 实现了在广泛任务中的实时、自适应操作。在模拟和现实世界设置中的大量实验表明，我们的方法减少了对任务特定动作数据的依赖，并能有效泛化，为智能机器人系统提供了一个实用且可扩展的解决方案。

Comments:	9th Conference on Robot Learning (CoRL 2025), Seoul, Korea
Subjects:	Robotics (cs.RO)
Cite as:	arXiv:2509.00361 [cs.RO]
	(or arXiv:2509.00361v1 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2509.00361

Submission history

From: Chuye Zhang [view email]
[v1] Sat, 30 Aug 2025 04:53:32 UTC (40,485 KB)

Computer Science > Robotics

Title: Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-Top Manipulation

Title: 生成性视觉预测在机器人桌面操作中与任务无关的位姿估计相结合

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title: Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-Top Manipulation Show Chinese title

Title: 生成性视觉预测在机器人桌面操作中与任务无关的位姿估计相结合

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-Top Manipulation