Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > cs > arXiv:2510.00037v2

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

arXiv:2510.00037v2 (cs)
[Submitted on 26 Sep 2025 (v1) , last revised 15 Oct 2025 (this version, v2)]

Title: On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations

Title: 关于视觉-语言-动作模型对多模态扰动的鲁棒性

Authors:Jianing Guo, Zhenhong Wu, Chang Tu, Yiyao Ma, Xiangqi Kong, Zhiqian Liu, Jiaming Ji, Shuning Zhang, Yuanpei Chen, Kai Chen, Xianglong Liu, Qi Dou, Yaodong Yang, Huijie Zhao, Weifeng Lv, Simin Li
Abstract: In Vision-Language-Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) pi0 demonstrates superior robustness with a diffusion-based action head. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust VLAs, and a 10.4% gain under mixed perturbations. Our RobustVLA is particularly effective on real-world FR5 robot with limited demonstrations, showing absolute gains by 65.6% under perturbations of four modalities.
Abstract: 在视觉-语言-动作(VLA)模型中,对现实世界扰动的鲁棒性对于部署至关重要。 现有方法针对简单的视觉干扰,忽视了在动作、指令、环境和观察中出现的更广泛的多模态扰动。 在这里,我们首先评估主流VLA在四个模态下的17种扰动下的鲁棒性。 我们发现(1)动作是最脆弱的模态,(2)现有的视觉鲁棒VLA在其他模态中并未获得鲁棒性,以及(3)pi0在基于扩散的动作头方面表现出优越的鲁棒性。 为了构建多模态鲁棒的VLA,我们提出了RobustVLA,以应对VLA输入和输出中的扰动。 对于输出鲁棒性,我们进行离线鲁棒优化,以对抗最大化流匹配目标不匹配的最坏情况动作噪声。 这可以看作是对抗训练、标签平滑和异常值惩罚。 对于输入鲁棒性,我们在保留任务语义的输入变化中强制一致的动作。 为了考虑多种扰动,我们将鲁棒性形式化为一个多臂老虎机问题,并应用上置信界算法自动识别最具破坏性的噪声。 在LIBERO上的实验表明,我们的RobustVLA在所有17种扰动下,相对于基线在pi0主干上实现了12.6%的绝对增益,在OpenVLA主干上实现了10.4%的绝对增益,推理速度比现有的视觉鲁棒VLA快50.6倍,并在混合扰动下实现了10.4%的增益。 我们的RobustVLA在具有有限演示的真实世界FR5机器人上特别有效,在四种模态的扰动下实现了65.6%的绝对增益。
Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Cite as: arXiv:2510.00037 [cs.CV]
  (or arXiv:2510.00037v2 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2510.00037
arXiv-issued DOI via DataCite

Submission history

From: Jianing Guo [view email]
[v1] Fri, 26 Sep 2025 14:42:23 UTC (8,571 KB)
[v2] Wed, 15 Oct 2025 08:40:46 UTC (8,571 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
view license
Current browse context:
cs.CV
< prev   |   next >
new | recent | 2025-10
Change to browse by:
cs
cs.AI

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号