A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Zhong, Yifan; Bai, Fengshuo; Cai, Shaofei; Huang, Xuchuan; Chen, Zhang; Zhang, Xiaowei; Wang, Yuanfei; Guo, Shaoyang; Guan, Tianrui; Lui, Ka Nam; Qi, Zhiquan; Liang, Yitao; Chen, Yuanpei; Yang, Yaodong

计算机科学 > 机器人技术

arXiv:2507.01925 (cs)

[提交于 2025年7月2日 ]

标题：视觉-语言-动作模型综述：一种动作标记化视角

标题： A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Authors:Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang

摘要：视觉和语言基础模型在多模态理解、推理和生成方面的显著进步，激发了人们将此类智能扩展到物理世界的日益增长的努力，从而推动了视觉-语言-动作 (VLA) 模型的蓬勃发展。尽管方法看似多样，但我们观察到，当前的 VLA 模型可以统一在一个框架下：视觉和语言输入由一系列 VLA 模块处理，生成一系列 \textit{动作标记}，这些动作标记逐步编码更扎实、更可操作的信息，最终生成可执行的动作。我们进一步确定，区分 VLA 模型的主要设计选择在于动作标记的表述方式，动作标记可以分为语言描述、代码、可供性、轨迹、目标状态、潜在表征、原始动作和推理。然而，目前仍然缺乏对动作标记的全面理解，这严重阻碍了 VLA 的有效开发，并模糊了未来的发展方向。因此，本综述旨在通过动作标记化的视角对现有的 VLA 研究进行分类和解读，提炼每种标记类型的优势和局限性，并找出需要改进的领域。通过系统的回顾和分析，我们对 VLA 模型的更广泛发展进行了综合展望，重点介绍了一些尚未得到充分探索但前景光明的研究方向，并为未来的研究提供了指导，希望使该领域更接近通用智能。

摘要： The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.

评论：	70页，5图
主题：	机器人技术 (cs.RO)
引用方式：	arXiv:2507.01925 [cs.RO]
	(或者 arXiv:2507.01925v1 [cs.RO] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.01925

提交历史

来自： Yuanpei Chen [查看电子邮件]
[v1] 星期三， 2025 年 7 月 2 日 17:34:52 UTC (16,319 KB)

计算机科学 > 机器人技术

标题：视觉-语言-动作模型综述：一种动作标记化视角

标题： A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器人技术

标题： 视觉-语言-动作模型综述：一种动作标记化视角 显示英文标题

标题： A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：视觉-语言-动作模型综述：一种动作标记化视角