CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation

Li, Hao; Yang, Shuai; Chen, Yilun; Tian, Yang; Yang, Xiaoda; Chen, Xinyi; Wang, Hanqing; Wang, Tai; Zhao, Feng; Lin, Dahua; Pang, Jiangmiao

计算机科学 > 机器人技术

arXiv:2506.19816 (cs)

[提交于 2025年6月24日 ]

标题： CronusVLA：在操作中跨时间传递潜在运动以进行多帧预测

标题： CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation

Authors:Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, Jiangmiao Pang

摘要：最近基于预训练视觉-语言模型（VLM）的视觉-语言-动作（VLA）模型在操作任务中表现出强大的泛化能力。然而，它们仍然受到单帧观察范式的限制，无法充分利用聚合多帧历史观察提供的运动信息，因为大型视觉-语言主干模型会引入大量的计算成本和推理延迟。我们提出了CronusVLA，一个统一的框架，通过高效的微调阶段将单帧VLA模型扩展到多帧范式。 CronusVLA包含三个关键组件：(1) 在大规模具身数据集上进行单帧预训练，通过自回归动作标记预测，建立一个具身视觉-语言基础；(2) 多帧编码，在微调过程中将视觉-语言主干的离散动作标记预测转换为运动特征，并将历史帧的运动特征聚合为特征分块；(3) 跨帧解码，通过具有交叉注意力的共享解码器将特征分块映射到准确的动作。通过减少冗余标记计算并缓存过去的运动特征，CronusVLA实现了高效的推理。作为运动特征的应用，我们进一步提出了一种基于特征-动作检索的动作适应机制，以在微调过程中提高模型性能。 CronusVLA在SimplerEnv上取得了最先进的性能，成功率为70.9%，在LIBERO上比OpenVLA提高了12.7%。真实世界的Franka实验也展示了其强大的性能和鲁棒性。

摘要： Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substantial computational cost and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage. CronusVLA comprises three key components: (1) single-frame pretraining on large-scale embodied datasets with autoregressive action tokens prediction, which establishes an embodied vision-language foundation; (2) multi-frame encoding, adapting the prediction of vision-language backbones from discrete action tokens to motion features during post-training, and aggregating motion features from historical frames into a feature chunking; (3) cross-frame decoding, which maps the feature chunking to accurate actions via a shared decoder with cross-attention. By reducing redundant token computation and caching past motion features, CronusVLA achieves efficient inference. As an application of motion features, we further propose an action adaptation mechanism based on feature-action retrieval to improve model performance during finetuning. CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO. Real-world Franka experiments also show the strong performance and robustness.

评论：	36页，21图
主题：	机器人技术 (cs.RO) ; 计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2506.19816 [cs.RO]
	(或者 arXiv:2506.19816v1 [cs.RO] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.19816

提交历史

来自： Hao Li [查看电子邮件]
[v1] 星期二， 2025 年 6 月 24 日 17:30:27 UTC (15,608 KB)

计算机科学 > 机器人技术

标题： CronusVLA：在操作中跨时间传递潜在运动以进行多帧预测

标题： CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器人技术

标题： CronusVLA：在操作中跨时间传递潜在运动以进行多帧预测 显示英文标题

标题： CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： CronusVLA：在操作中跨时间传递潜在运动以进行多帧预测