BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models

Wang, Zihan; Li, Hongwei; Zhang, Rui; Jiang, Wenbo; Chen, Kangjie; Zhang, Tianwei; Zhao, Qingchuan; Xu, Guowen

计算机科学 > 密码学与安全

arXiv:2505.03501 (cs)

[提交于 2025年5月6日 ]

标题： BadLingual：针对大型语言模型的新型舌语后门攻击

标题： BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models

Authors:Zihan Wang, Hongwei Li, Rui Zhang, Wenbo Jiang, Kangjie Chen, Tianwei Zhang, Qingchuan Zhao, Guowen Xu

摘要：在本文中，我们提出了一种针对大型语言模型 (LLM) 的新型后门攻击：语言后门攻击。语言后门攻击的关键创新之处在于，语言本身可以作为触发器，劫持受感染的 LLM 并生成煽动性言论。它们能够精准地针对特定语言群体，加剧恶意实体的种族歧视。我们首先实现了一个基线语言后门攻击，该攻击通过将特定下游任务的训练数据翻译成触发语言来毒害数据集来实现。然而，这种基线攻击的任务泛化能力较差，在实际环境中不切实际。为了应对这一挑战，我们设计了一个与任务无关的新型语言后门 BadLingual，它能够触发聊天 LLM 中的任何下游任务，而不管这些任务的具体问题是什么。我们设计了一种新方法，使用基于 PPL 约束的贪婪坐标梯度搜索 (PGCG) 对抗训练来扩展语言后门的决策边界，从而增强语言后门在各种任务中的泛化能力。我们进行了大量实验来验证所提攻击的有效性。具体而言，基线攻击在指定任务上实现了超过 90% 的 ASR。然而，在任务无关的场景中，其在六个任务中的 ASR 仅为 37.61%。相比之下，BadLingual 比基线提升了 37.35%。我们的研究为具有多语言功能的 LLM 中的漏洞提供了一个新的视角，并有望促进未来对潜在防御措施的研究，以增强 LLM 的鲁棒性。

摘要： In this paper, we present a new form of backdoor attack against Large Language Models (LLMs): lingual-backdoor attacks. The key novelty of lingual-backdoor attacks is that the language itself serves as the trigger to hijack the infected LLMs to generate inflammatory speech. They enable the precise targeting of a specific language-speaking group, exacerbating racial discrimination by malicious entities. We first implement a baseline lingual-backdoor attack, which is carried out by poisoning a set of training data for specific downstream tasks through translation into the trigger language. However, this baseline attack suffers from poor task generalization and is impractical in real-world settings. To address this challenge, we design BadLingual, a novel task-agnostic lingual-backdoor, capable of triggering any downstream tasks within the chat LLMs, regardless of the specific questions of these tasks. We design a new approach using PPL-constrained Greedy Coordinate Gradient-based Search (PGCG) based adversarial training to expand the decision boundary of lingual-backdoor, thereby enhancing the generalization ability of lingual-backdoor across various tasks. We perform extensive experiments to validate the effectiveness of our proposed attacks. Specifically, the baseline attack achieves an ASR of over 90% on the specified tasks. However, its ASR reaches only 37.61% across six tasks in the task-agnostic scenario. In contrast, BadLingual brings up to 37.35% improvement over the baseline. Our study sheds light on a new perspective of vulnerabilities in LLMs with multilingual capabilities and is expected to promote future research on the potential defenses to enhance the LLMs' robustness

主题：	密码学与安全 (cs.CR) ; 计算与语言 (cs.CL)
引用方式：	arXiv:2505.03501 [cs.CR]
	(或者 arXiv:2505.03501v1 [cs.CR] 对于此版本)
	https://doi.org/10.48550/arXiv.2505.03501

提交历史

来自： Zihan Wang [查看电子邮件]
[v1] 星期二， 2025 年 5 月 6 日 13:07:57 UTC (1,429 KB)

计算机科学 > 密码学与安全

标题： BadLingual：针对大型语言模型的新型舌语后门攻击

标题： BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 密码学与安全

标题： BadLingual：针对大型语言模型的新型舌语后门攻击 显示英文标题

标题： BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： BadLingual：针对大型语言模型的新型舌语后门攻击