Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning

Chen, Hao; Liu, Jiaming; Gu, Chenyang; Liu, Zhuoyang; Zhang, Renrui; Li, Xiaoqi; He, Xiao; Guo, Yandong; Fu, Chi-Wing; Zhang, Shanghang; Heng, Pheng-Ann

Computer Science > Robotics

arXiv:2506.01953 (cs)

[Submitted on 2 Jun 2025 ]

Title: Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning

Title: 快在慢中：统一快速操作与缓慢推理的双重系统基础模型

Authors:Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi-Wing Fu, Shanghang Zhang, Pheng-Ann Heng

Abstract: Generalized policy and execution efficiency constitute the two critical challenges in robotic manipulation. While recent foundation policies benefit from the common-sense reasoning capabilities of internet-scale pretrained vision-language models (VLMs), they often suffer from low execution frequency. To mitigate this dilemma, dual-system approaches, inspired by Kahneman's theory, have been proposed to leverage a VLM-based System 2 model handling high-level reasoning and a separate System 1 action model ensuring real-time control. However, existing designs maintain both systems as separate models, limiting System 1 from fully leveraging the rich pretrained knowledge from the VLM-based System 2. In this work, we propose Fast-in-Slow (FiS), a unified dual-system vision-language-action (VLA) model that embeds the System 1 execution module within the VLM-based System 2 by partially sharing parameters. This innovative paradigm not only enables high-frequency execution in System 1 but also facilitates coordination between the reasoning and execution components within a single foundation model of System 2. Given their fundamentally distinct roles within FiS-VLA, we design the two systems to incorporate heterogeneous modality inputs alongside asynchronous operating frequencies, enabling both fast and precise manipulation. To enable coordination between the two systems, a dual-aware co-training strategy is proposed that equips System 1 with action generation capabilities while preserving System 2's contextual reasoning representation. For evaluation, FiS-VLA outperforms previous state-of-the-art methods by 8% in simulation and 11% in real-world tasks in terms of average success rate, while achieving a 117.7 Hz control frequency with action chunk set to eight. Project web page: fast-in-slow.github.io.

Abstract: 通用策略和执行效率构成了机器人操作的两个关键挑战。虽然最近的基础策略得益于互联网规模预训练视觉-语言模型（VLMs）的常识推理能力，但它们往往面临低执行频率的问题。为了解决这一困境，受卡尼曼理论的启发，提出了双系统方法，利用基于VLM的系统2模型处理高级推理，以及独立的系统1动作模型确保实时控制。然而，现有的设计将两个系统保持为独立模型，限制了系统1无法充分利用来自基于VLM的系统2的丰富预训练知识。在这项工作中，我们提出了Fast-in-Slow（FiS），这是一种统一的双系统视觉-语言-动作（VLA）模型，通过部分共享参数将系统1执行模块嵌入到基于VLM的系统2中。这种创新范式不仅使系统1能够实现高频执行，还促进了系统2单个基础模型内推理和执行组件之间的协调。鉴于FiS-VLA中两个系统具有根本不同的角色，我们设计了两个系统以整合异构模态输入和异步操作频率，从而实现快速且精确的操作。为了促进两个系统的协调，提出了一种双感知协同训练策略，该策略使系统1具备动作生成能力，同时保留系统2的上下文推理表示。对于评估， FiS-VLA在模拟环境中比之前的最先进方法提高了8%，在现实任务中的平均成功率提高了11%，同时在动作块设置为八的情况下实现了117.7 Hz的控制频率。项目网页：fast-in-slow.github.io。

Subjects:	Robotics (cs.RO)
Cite as:	arXiv:2506.01953 [cs.RO]
	(or arXiv:2506.01953v1 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2506.01953

Submission history

From: Hao Chen [view email]
[v1] Mon, 2 Jun 2025 17:59:51 UTC (6,143 KB)

Computer Science > Robotics

Title: Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning

Title: 快在慢中：统一快速操作与缓慢推理的双重系统基础模型

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title: Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning Show Chinese title

Title: 快在慢中：统一快速操作与缓慢推理的双重系统基础模型

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning