Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > cs > arXiv:2501.03841

Help | Advanced Search

Computer Science > Robotics

arXiv:2501.03841 (cs)
[Submitted on 7 Jan 2025 ]

Title: OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Title: OmniManip:通过以物体为中心的交互原语作为空间约束实现通用机器人操作

Authors:Mingjie Pan, Jiyao Zhang, Tianshu Wu, Yinghao Zhao, Wenlong Gao, Hao Dong
Abstract: The development of general robotic systems capable of manipulating in unstructured environments is a significant challenge. While Vision-Language Models(VLM) excel in high-level commonsense reasoning, they lack the fine-grained 3D spatial understanding required for precise manipulation tasks. Fine-tuning VLM on robotic datasets to create Vision-Language-Action Models(VLA) is a potential solution, but it is hindered by high data collection costs and generalization issues. To address these challenges, we propose a novel object-centric representation that bridges the gap between VLM's high-level reasoning and the low-level precision required for manipulation. Our key insight is that an object's canonical space, defined by its functional affordances, provides a structured and semantically meaningful way to describe interaction primitives, such as points and directions. These primitives act as a bridge, translating VLM's commonsense reasoning into actionable 3D spatial constraints. In this context, we introduce a dual closed-loop, open-vocabulary robotic manipulation system: one loop for high-level planning through primitive resampling, interaction rendering and VLM checking, and another for low-level execution via 6D pose tracking. This design ensures robust, real-time control without requiring VLM fine-tuning. Extensive experiments demonstrate strong zero-shot generalization across diverse robotic manipulation tasks, highlighting the potential of this approach for automating large-scale simulation data generation.
Abstract: 通用机器人系统在非结构化环境中进行操作的发展是一个重大挑战。 虽然视觉-语言模型(VLM)在高层次常识推理方面表现出色,但它们缺乏精确操作任务所需的细粒度3D空间理解。 在机器人数据集上微调VLM以创建视觉-语言-动作模型(VLA)是一种潜在的解决方案,但受到高数据收集成本和泛化问题的阻碍。 为了解决这些挑战,我们提出了一种新颖的对象中心表示,它弥合了VLM的高层推理与操作所需低层精度之间的差距。 我们的关键见解是,由其功能可供性定义的对象规范空间提供了一种结构化且语义有意义的方式来描述交互原语,如点和方向。 这些原语起到桥梁作用,将VLM的常识推理转化为可操作的3D空间约束。 在此背景下,我们引入了一个双闭环、开放词汇的机器人操作系统:一个循环通过原语重采样、交互渲染和VLM检查进行高层规划,另一个通过6D位姿跟踪进行低层执行。 这种设计确保了无需VLM微调的鲁棒实时控制。 大量实验表明,在各种机器人操作任务中具有强大的零样本泛化能力,突显了这种方法在自动化大规模模拟数据生成方面的潜力。
Subjects: Robotics (cs.RO)
Cite as: arXiv:2501.03841 [cs.RO]
  (or arXiv:2501.03841v1 [cs.RO] for this version)
  https://doi.org/10.48550/arXiv.2501.03841
arXiv-issued DOI via DataCite

Submission history

From: Jiyao Zhang [view email]
[v1] Tue, 7 Jan 2025 14:50:33 UTC (13,552 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
view license
Current browse context:
cs.RO
< prev   |   next >
new | recent | 2025-01
Change to browse by:
cs

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号