CLAW: A Vision-Language-Action Framework for Weight-Aware Robotic Grasping

An, Zijian; Yang, Ran; Feng, Yiming; Zhou, Lifeng

Computer Science > Robotics

arXiv:2509.14143 (cs)

[Submitted on 17 Sep 2025 ]

Title: CLAW: A Vision-Language-Action Framework for Weight-Aware Robotic Grasping

Title: CLAW：一种感知重量的视觉-语言-动作框架用于机器人抓取

Authors:Zijian An, Ran Yang, Yiming Feng, Lifeng Zhou

Abstract: Vision-language-action (VLA) models have recently emerged as a promising paradigm for robotic control, enabling end-to-end policies that ground natural language instructions into visuomotor actions. However, current VLAs often struggle to satisfy precise task constraints, such as stopping based on numeric thresholds, since their observation-to-action mappings are implicitly shaped by training data and lack explicit mechanisms for condition monitoring. In this work, we propose CLAW (CLIP-Language-Action for Weight), a framework that decouples condition evaluation from action generation. CLAW leverages a fine-tuned CLIP model as a lightweight prompt generator, which continuously monitors the digital readout of a scale and produces discrete directives based on task-specific weight thresholds. These prompts are then consumed by $\pi_0$, a flow-based VLA policy, which integrates the prompts with multi-view camera observations to produce continuous robot actions. This design enables CLAW to combine symbolic weight reasoning with high-frequency visuomotor control. We validate CLAW on three experimental setups: single-object grasping and mixed-object tasks requiring dual-arm manipulation. Across all conditions, CLAW reliably executes weight-aware behaviors and outperforms both raw-$\pi_0$ and fine-tuned $\pi_0$ models. We have uploaded the videos as supplementary materials.

Abstract: 视觉-语言-动作（VLA）模型最近作为一种有前途的机器人控制范式出现，使端到端策略能够将自然语言指令转化为视觉运动动作。然而，当前的VLAs往往难以满足精确的任务约束，例如基于数值阈值的停止，因为它们的观察到动作映射是由训练数据隐式塑造的，并且缺乏用于条件监控的显式机制。在本工作中，我们提出了CLAW（CLIP-语言-动作用于重量），一个将条件评估与动作生成解耦的框架。 CLAW利用微调的CLIP模型作为轻量级提示生成器，该生成器持续监控秤的数字读数，并根据任务特定的重量阈值生成离散指令。然后，这些提示被$\pi_0$消耗，这是一个基于流的VLA策略，它将提示与多视角相机观测相结合，以生成连续的机器人动作。这种设计使CLAW能够结合符号重量推理与高频视觉运动控制。我们在三个实验设置上验证了CLAW：单个物体抓取和需要双臂操作的混合物体任务。在所有条件下，CLAW都能可靠地执行重量感知行为，并优于原始的$\pi_0$和微调的$\pi_0$模型。我们已将视频作为补充材料上传。

Comments:	8 pages, 5 figures, 1 table
Subjects:	Robotics (cs.RO)
MSC classes:	68T40
Cite as:	arXiv:2509.14143 [cs.RO]
	(or arXiv:2509.14143v1 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2509.14143

Submission history

From: Zijian An [view email]
[v1] Wed, 17 Sep 2025 16:22:25 UTC (10,062 KB)

Computer Science > Robotics

Title: CLAW: A Vision-Language-Action Framework for Weight-Aware Robotic Grasping

Title: CLAW：一种感知重量的视觉-语言-动作框架用于机器人抓取

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title: CLAW: A Vision-Language-Action Framework for Weight-Aware Robotic Grasping Show Chinese title

Title: CLAW：一种感知重量的视觉-语言-动作框架用于机器人抓取

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: CLAW: A Vision-Language-Action Framework for Weight-Aware Robotic Grasping