"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models

Gupta, Isha; Khachaturov, David; Mullins, Robert

计算机科学 > 机器学习

arXiv:2502.00718v1 (cs)

[提交于 2025年2月2日 (此版本) ， 最新版本 2025年7月10日 (v2) ]

标题： “我不好”：解读音频语言模型中的隐蔽、通用和鲁棒音频越狱攻击

标题： "I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models

Authors:Isha Gupta, David Khachaturov, Robert Mullins

摘要：多模态大型语言模型的兴起引入了创新的人机交互范式，但也给机器学习安全带来了重大挑战。语音-语言模型（ALMs）尤其相关，因为口语交流的直观性，但对其故障模式了解甚少。本文探讨针对ALMs的音频越狱攻击，重点研究其绕过对齐机制的能力。我们构建了跨提示、任务甚至基础音频样本的对抗性扰动，展示了音频模态中的首次通用越狱，并表明这些方法在模拟现实条件中仍然有效。除了证明攻击的可行性外，我们分析了ALMs如何解释这些音频对抗性示例，并揭示它们编码了难以察觉的第一人称有害言论——这表明最有效的引发有害输出的扰动特别在音频信号中嵌入了语言特征。这些结果对于理解多模态模型中不同模态之间的相互作用具有重要意义，并为增强防御对抗性音频攻击提供了可操作的见解。

摘要： The rise of multimodal large language models has introduced innovative human-machine interaction paradigms but also significant challenges in machine learning safety. Audio-Language Models (ALMs) are especially relevant due to the intuitive nature of spoken communication, yet little is known about their failure modes. This paper explores audio jailbreaks targeting ALMs, focusing on their ability to bypass alignment mechanisms. We construct adversarial perturbations that generalize across prompts, tasks, and even base audio samples, demonstrating the first universal jailbreaks in the audio modality, and show that these remain effective in simulated real-world conditions. Beyond demonstrating attack feasibility, we analyze how ALMs interpret these audio adversarial examples and reveal them to encode imperceptible first-person toxic speech - suggesting that the most effective perturbations for eliciting toxic outputs specifically embed linguistic features within the audio signal. These results have important implications for understanding the interactions between different modalities in multimodal models, and offer actionable insights for enhancing defenses against adversarial audio attacks.

主题：	机器学习 (cs.LG) ; 声音 (cs.SD); 音频与语音处理 (eess.AS)
引用方式：	arXiv:2502.00718 [cs.LG]
	(或者 arXiv:2502.00718v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2502.00718

提交历史

来自： Isha Gupta [查看电子邮件]
[v1] 星期日， 2025 年 2 月 2 日 08:36:23 UTC (1,762 KB)
[v2] 星期四， 2025 年 7 月 10 日 14:44:44 UTC (1,172 KB)

计算机科学 > 机器学习

标题： “我不好”：解读音频语言模型中的隐蔽、通用和鲁棒音频越狱攻击

标题： "I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： “我不好”：解读音频语言模型中的隐蔽、通用和鲁棒音频越狱攻击 显示英文标题

标题： "I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： “我不好”：解读音频语言模型中的隐蔽、通用和鲁棒音频越狱攻击