Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

Wu, Sihao; Jin, Gaojie; Huang, Wei; Wang, Jianhong; Huang, Xiaowei

计算机科学 > 计算机视觉与模式识别

arXiv:2509.00373 (cs)

[提交于 2025年8月30日 ]

标题：激活引导结合偏好优化：对抗视觉语言模型中越狱攻击的防御方法

标题： Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

Authors:Sihao Wu, Gaojie Jin, Wei Huang, Jianhong Wang, Xiaowei Huang

摘要：视觉语言模型（VLMs）在整合视觉和文本信息以进行理解和推理方面表现出色，但对对抗攻击仍然高度脆弱。虽然激活引导已成为一种有前景的防御方法，但现有方法通常依赖于任务特定的对比提示来提取有害方向，这表现出次优性能，并可能损害视觉定位性能。为解决这些限制，我们提出了\textit{序列级偏好优化}用于 VLM (\textit{SPO-VLM})，一种结合激活级干预与策略级优化的新型两阶段防御框架，以增强模型鲁棒性。在\textit{第一阶段}中，我们从多种数据源计算自适应层特定的引导向量，实现在推理过程中对有害行为的通用抑制。在\textit{第二阶段}中，我们通过序列级偏好优化过程细化这些引导向量。此阶段结合了自动毒性评估，以及基于描述-图像对齐的视觉一致性奖励，以实现安全且语义上有根据的文本生成。 SPO-VLM 的两阶段结构通过将第一阶段的轻量级缓解基础与第二阶段的深度策略优化相结合，平衡了效率和有效性。大量实验表明，SPO-VLM 通过激活引导和偏好优化增强了安全性，同时在良性任务上保持了强大的性能，而不会损害视觉理解能力。我们将发布代码、模型权重和评估工具包，以支持可重复性和未来研究。 \textcolor{red}{警告：本文可能包含具有冒犯性或有害的文字和图像。}

摘要： Vision Language Models (VLMs) have demonstrated impressive capabilities in integrating visual and textual information for understanding and reasoning, but remain highly vulnerable to adversarial attacks. While activation steering has emerged as a promising defence, existing approaches often rely on task-specific contrastive prompts to extract harmful directions, which exhibit suboptimal performance and can degrade visual grounding performance. To address these limitations, we propose \textit{Sequence-Level Preference Optimization} for VLM (\textit{SPO-VLM}), a novel two-stage defense framework that combines activation-level intervention with policy-level optimization to enhance model robustness. In \textit{Stage I}, we compute adaptive layer-specific steering vectors from diverse data sources, enabling generalized suppression of harmful behaviors during inference. In \textit{Stage II}, we refine these steering vectors through a sequence-level preference optimization process. This stage integrates automated toxicity assessment, as well as visual-consistency rewards based on caption-image alignment, to achieve safe and semantically grounded text generation. The two-stage structure of SPO-VLM balances efficiency and effectiveness by combining a lightweight mitigation foundation in Stage I with deeper policy refinement in Stage II. Extensive experiments shown SPO-VLM enhances safety against attacks via activation steering and preference optimization, while maintaining strong performance on benign tasks without compromising visual understanding capabilities. We will release our code, model weights, and evaluation toolkit to support reproducibility and future research. \textcolor{red}{Warning: This paper may contain examples of offensive or harmful text and images.}

主题：	计算机视觉与模式识别 (cs.CV) ; 人工智能 (cs.AI)
引用方式：	arXiv:2509.00373 [cs.CV]
	(或者 arXiv:2509.00373v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2509.00373

提交历史

来自： Sihao Wu [查看电子邮件]
[v1] 星期六， 2025 年 8 月 30 日 06:00:53 UTC (3,253 KB)

计算机科学 > 计算机视觉与模式识别

标题：激活引导结合偏好优化：对抗视觉语言模型中越狱攻击的防御方法

标题： Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： 激活引导结合偏好优化：对抗视觉语言模型中越狱攻击的防御方法 显示英文标题

标题： Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：激活引导结合偏好优化：对抗视觉语言模型中越狱攻击的防御方法