Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > cs > arXiv:2506.01953

Help | Advanced Search

Computer Science > Robotics

arXiv:2506.01953 (cs)
[Submitted on 2 Jun 2025 ]

Title: Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning

Title: 快在慢中:统一快速操作与缓慢推理的双重系统基础模型

Authors:Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi-Wing Fu, Shanghang Zhang, Pheng-Ann Heng
Abstract: Generalized policy and execution efficiency constitute the two critical challenges in robotic manipulation. While recent foundation policies benefit from the common-sense reasoning capabilities of internet-scale pretrained vision-language models (VLMs), they often suffer from low execution frequency. To mitigate this dilemma, dual-system approaches, inspired by Kahneman's theory, have been proposed to leverage a VLM-based System 2 model handling high-level reasoning and a separate System 1 action model ensuring real-time control. However, existing designs maintain both systems as separate models, limiting System 1 from fully leveraging the rich pretrained knowledge from the VLM-based System 2. In this work, we propose Fast-in-Slow (FiS), a unified dual-system vision-language-action (VLA) model that embeds the System 1 execution module within the VLM-based System 2 by partially sharing parameters. This innovative paradigm not only enables high-frequency execution in System 1 but also facilitates coordination between the reasoning and execution components within a single foundation model of System 2. Given their fundamentally distinct roles within FiS-VLA, we design the two systems to incorporate heterogeneous modality inputs alongside asynchronous operating frequencies, enabling both fast and precise manipulation. To enable coordination between the two systems, a dual-aware co-training strategy is proposed that equips System 1 with action generation capabilities while preserving System 2's contextual reasoning representation. For evaluation, FiS-VLA outperforms previous state-of-the-art methods by 8% in simulation and 11% in real-world tasks in terms of average success rate, while achieving a 117.7 Hz control frequency with action chunk set to eight. Project web page: fast-in-slow.github.io.
Abstract: 通用策略和执行效率构成了机器人操作的两个关键挑战。虽然最近的基础策略得益于互联网规模预训练视觉-语言模型(VLMs)的常识推理能力,但它们往往面临低执行频率的问题。为了解决这一困境,受卡尼曼理论的启发,提出了双系统方法,利用基于VLM的系统2模型处理高级推理,以及独立的系统1动作模型确保实时控制。 然而,现有的设计将两个系统保持为独立模型,限制了系统1无法充分利用来自基于VLM的系统2的丰富预训练知识。 在这项工作中,我们提出了Fast-in-Slow(FiS),这是一种统一的双系统视觉-语言-动作(VLA)模型,通过部分共享参数将系统1执行模块嵌入到基于VLM的系统2中。这种创新范式不仅使系统1能够实现高频执行,还促进了系统2单个基础模型内推理和执行组件之间的协调。 鉴于FiS-VLA中两个系统具有根本不同的角色,我们设计了两个系统以整合异构模态输入和异步操作频率,从而实现快速且精确的操作。 为了促进两个系统的协调,提出了一种双感知协同训练策略,该策略使系统1具备动作生成能力,同时保留系统2的上下文推理表示。 对于评估, FiS-VLA在模拟环境中比之前的最先进方法提高了8%,在现实任务中的平均成功率提高了11%,同时在动作块设置为八的情况下实现了117.7 Hz的控制频率。 项目网页:fast-in-slow.github.io。
Subjects: Robotics (cs.RO)
Cite as: arXiv:2506.01953 [cs.RO]
  (or arXiv:2506.01953v1 [cs.RO] for this version)
  https://doi.org/10.48550/arXiv.2506.01953
arXiv-issued DOI via DataCite

Submission history

From: Hao Chen [view email]
[v1] Mon, 2 Jun 2025 17:59:51 UTC (6,143 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
  • Other Formats
view license
Current browse context:
cs.RO
< prev   |   next >
new | recent | 2025-06
Change to browse by:
cs

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号