IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Lu, Xiaoya; Chen, Zeren; Hu, Xuhao; Zhou, Yijin; Zhang, Weichen; Liu, Dongrui; Sheng, Lu; Shao, Jing

计算机科学 > 人工智能

arXiv:2506.16402 (cs)

[提交于 2025年6月19日 ]

标题： IS-Bench：评估VLM驱动的具身代理在日常家务任务中的交互安全性

标题： IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Authors:Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao

摘要：由视觉语言模型驱动的具身代理如果规划不当，会带来显著的安全隐患，阻碍其在现实世界家庭任务中的部署。然而，现有的静态、非交互式评估范式无法充分评估这些交互环境中存在的风险，因为它们无法模拟由代理行为产生的动态风险，且依赖于忽略不安全中间步骤的事后评价。为弥合这一关键差距，我们提议评估代理的交互安全性：即其感知突发风险并按正确程序顺序执行缓解措施的能力。因此，我们提出了IS-Bench，这是首个为交互安全性设计的多模态基准测试，包含161个具有挑战性的场景和388个独特的安全风险实例，均在高保真模拟器中实现。至关重要的是，它引入了一种新的过程导向评估方法，验证风险缓解行动是否在特定风险易发步骤之前或之后执行。在包括GPT-4o和Gemini-2.5系列在内的领先视觉语言模型上进行的广泛实验表明，当前的代理缺乏交互安全性意识，虽然安全意识的链式思维可以提高性能，但往往会影响任务完成。通过强调这些关键限制，IS-Bench为开发更安全、更可靠的具身人工智能系统奠定了基础。

摘要： Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems.

主题：	人工智能 (cs.AI) ; 计算与语言 (cs.CL); 计算机视觉与模式识别 (cs.CV); 机器学习 (cs.LG); 机器人技术 (cs.RO)
引用方式：	arXiv:2506.16402 [cs.AI]
	(或者 arXiv:2506.16402v1 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.16402

提交历史

来自： Xiaoya Lu [查看电子邮件]
[v1] 星期四， 2025 年 6 月 19 日 15:34:46 UTC (1,112 KB)

计算机科学 > 人工智能

标题： IS-Bench：评估VLM驱动的具身代理在日常家务任务中的交互安全性

标题： IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： IS-Bench：评估VLM驱动的具身代理在日常家务任务中的交互安全性 显示英文标题

标题： IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： IS-Bench：评估VLM驱动的具身代理在日常家务任务中的交互安全性