Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models

Jiang, Peihai; Lyu, Xixiang; Li, Yige; Ma, Jing

计算机科学 > 密码学与安全

arXiv:2501.03272v1 (cs)

[提交于 2025年1月5日 ]

标题：后门令牌消除：揭示和防御预训练语言模型中的后门

标题： Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models

Authors:Peihai Jiang, Xixiang Lyu, Yige Li, Jing Ma

摘要：监督微调已成为适应大型预训练模型以进行下游任务的主要方法。然而，最近的研究表明，这些模型容易受到后门攻击，即使少量恶意样本也能成功将后门触发器嵌入到模型中。虽然大多数现有的防御方法专注于训练后的后门防御，但在训练阶段有效防御后门攻击仍 largely 未被探索。为了解决这一差距，我们提出了一种新的防御方法，称为后门令牌遗忘（BTU），该方法在训练阶段主动检测并中和触发器令牌。我们的工作基于两个关键发现：1）后门学习导致后门令牌参数与干净令牌参数在词嵌入层中存在显著差异，2）后门攻击的成功高度依赖于后门令牌参数。 BTU 防御利用这些特性来识别异常的嵌入参数，并随后使用细粒度的遗忘技术消除后门行为。在三个数据集和四种类型的后门攻击上的广泛评估表明，BTU 在保持模型在主要任务上的性能的同时，能够有效防御这些威胁。我们的代码可在 https://github.com/XDJPH/BTU 获取。

摘要： Supervised fine-tuning has become the predominant method for adapting large pretrained models to downstream tasks. However, recent studies have revealed that these models are vulnerable to backdoor attacks, where even a small number of malicious samples can successfully embed backdoor triggers into the model. While most existing defense methods focus on post-training backdoor defense, efficiently defending against backdoor attacks during training phase remains largely unexplored. To address this gap, we propose a novel defense method called Backdoor Token Unlearning (BTU), which proactively detects and neutralizes trigger tokens during the training stage. Our work is based on two key findings: 1) backdoor learning causes distinctive differences between backdoor token parameters and clean token parameters in word embedding layers, and 2) the success of backdoor attacks heavily depends on backdoor token parameters. The BTU defense leverages these properties to identify aberrant embedding parameters and subsequently removes backdoor behaviors using a fine-grained unlearning technique. Extensive evaluations across three datasets and four types of backdoor attacks demonstrate that BTU effectively defends against these threats while preserving the model's performance on primary tasks. Our code is available at https://github.com/XDJPH/BTU.

评论：	AAAI 2025
主题：	密码学与安全 (cs.CR) ; 人工智能 (cs.AI); 计算与语言 (cs.CL)
引用方式：	arXiv:2501.03272 [cs.CR]
	(或者 arXiv:2501.03272v1 [cs.CR] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.03272

提交历史

来自： Peihai Jiang [查看电子邮件]
[v1] 星期日， 2025 年 1 月 5 日 03:22:13 UTC (3,054 KB)

计算机科学 > 密码学与安全

标题：后门令牌消除：揭示和防御预训练语言模型中的后门

标题： Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 密码学与安全

标题： 后门令牌消除：揭示和防御预训练语言模型中的后门 显示英文标题

标题： Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：后门令牌消除：揭示和防御预训练语言模型中的后门