BURN: Backdoor Unlearning via Adversarial Boundary Analysis

Su, Yanghao; Zhang, Jie; Li, Yiming; Zhang, Tianwei; Guo, Qing; Zhang, Weiming; Yu, Nenghai; Lukas, Nils; Zhou, Wenbo

计算机科学 > 密码学与安全

arXiv:2507.10491 (cs)

[提交于 2025年7月14日 ]

标题： BURN：通过对抗边界分析的后门遗忘

标题： BURN: Backdoor Unlearning via Adversarial Boundary Analysis

Authors:Yanghao Su, Jie Zhang, Yiming Li, Tianwei Zhang, Qing Guo, Weiming Zhang, Nenghai Yu, Nils Lukas, Wenbo Zhou

摘要：后门遗忘旨在移除后门相关信息，同时保留模型的原始功能。然而，现有的遗忘方法主要关注于恢复触发模式，但未能恢复中毒样本的正确语义标签。这一限制使它们无法完全消除触发模式与目标标签之间的错误关联。为了解决这个问题，我们利用边界对抗攻击技术，揭示了两个关键观察结果。首先，中毒样本与决策边界之间的距离明显大于干净样本，表明它们需要更大的对抗扰动才能改变预测结果。其次，干净样本的对抗预测标签是均匀分布的，而中毒样本的对抗预测标签则倾向于恢复到其原始正确标签。此外，在添加对抗扰动后，中毒样本的特征会恢复为与相应干净样本非常相似的特征。基于这些见解，我们提出了通过对抗边界分析进行后门遗忘（BURN），这是一种新颖的防御框架，集成了错误关联解耦、渐进数据精炼和模型净化。在第一阶段，BURN利用对抗边界分析，根据其异常的对抗边界距离检测中毒样本，然后恢复其正确的语义标签以进行微调。在第二阶段，它采用一种反馈机制，跟踪原始带有后门的模型与逐步清理的模型之间的预测差异，从而指导数据集的精炼和模型的净化。在多个数据集、架构和七种不同的后门攻击类型上的广泛评估证实，BURN能够有效消除后门威胁，同时保持模型的原始性能。

摘要： Backdoor unlearning aims to remove backdoor-related information while preserving the model's original functionality. However, existing unlearning methods mainly focus on recovering trigger patterns but fail to restore the correct semantic labels of poison samples. This limitation prevents them from fully eliminating the false correlation between the trigger pattern and the target label. To address this, we leverage boundary adversarial attack techniques, revealing two key observations. First, poison samples exhibit significantly greater distances from decision boundaries compared to clean samples, indicating they require larger adversarial perturbations to change their predictions. Second, while adversarial predicted labels for clean samples are uniformly distributed, those for poison samples tend to revert to their original correct labels. Moreover, the features of poison samples restore to closely resemble those of corresponding clean samples after adding adversarial perturbations. Building upon these insights, we propose Backdoor Unlearning via adversaRial bouNdary analysis (BURN), a novel defense framework that integrates false correlation decoupling, progressive data refinement, and model purification. In the first phase, BURN employs adversarial boundary analysis to detect poisoned samples based on their abnormal adversarial boundary distances, then restores their correct semantic labels for fine-tuning. In the second phase, it employs a feedback mechanism that tracks prediction discrepancies between the original backdoored model and progressively sanitized models, guiding both dataset refinement and model purification. Extensive evaluations across multiple datasets, architectures, and seven diverse backdoor attack types confirm that BURN effectively removes backdoor threats while maintaining the model's original performance.

主题：	密码学与安全 (cs.CR)
引用方式：	arXiv:2507.10491 [cs.CR]
	(或者 arXiv:2507.10491v1 [cs.CR] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.10491

提交历史

来自： Yanghao Su [查看电子邮件]
[v1] 星期一， 2025 年 7 月 14 日 17:13:06 UTC (1,445 KB)

计算机科学 > 密码学与安全

标题： BURN：通过对抗边界分析的后门遗忘

标题： BURN: Backdoor Unlearning via Adversarial Boundary Analysis

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 密码学与安全

标题： BURN：通过对抗边界分析的后门遗忘 显示英文标题

标题： BURN: Backdoor Unlearning via Adversarial Boundary Analysis

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： BURN：通过对抗边界分析的后门遗忘