PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training

Du, Pengfei

计算机科学 > 密码学与安全

arXiv:2507.14202 (cs)

[提交于 2025年7月14日 ]

标题：通过红队和对抗训练实现大型模型的无PRM安全对齐

标题： PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training

Authors:Pengfei Du

摘要：大型语言模型（LLMs）在各种应用中表现出显著的能力，但它们带来了重大的安全风险，威胁其在关键领域的安全部署。当前的安全对齐方法主要依赖于过程奖励模型（PRMs）来评估中间推理步骤，这带来了显著的计算开销和可扩展性限制。本文提出了一种无需PRM的安全对齐框架，该框架利用自动化红队测试和对抗训练来实现稳健的安全保证，同时保持计算效率。我们的方法通过复杂的攻击策略系统地识别漏洞，包括遗传算法优化、多智能体模拟和先进的提示变异技术。该框架通过课程学习和自适应正则化机制进行有针对性的对抗训练，以增强模型的鲁棒性。在五种最先进的LLMs上的全面实验评估表明，与基于PRM的方法相比，我们的方法在减少61%计算成本的同时实现了更优越的安全对齐性能。该框架包含透明报告和持续审计机制，能够实现迭代的安全改进和合规性。我们的贡献通过为资源有限的组织民主化提供稳健的安全措施，并为应对不断演变的对抗性威胁提供可扩展的基础，推动了高效LLM安全对齐领域的发展。

摘要： Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, yet they pose significant security risks that threaten their safe deployment in critical domains. Current security alignment methodologies predominantly rely on Process Reward Models (PRMs) to evaluate intermediate reasoning steps, introducing substantial computational overhead and scalability constraints. This paper presents a novel PRM-free security alignment framework that leverages automated red teaming and adversarial training to achieve robust security guarantees while maintaining computational efficiency. Our approach systematically identifies vulnerabilities through sophisticated attack strategies including genetic algorithm optimization, multi-agent simulation, and advanced prompt mutation techniques. The framework enhances model robustness via targeted adversarial training with curriculum learning and adaptive regularization mechanisms. Comprehensive experimental evaluation across five state-of-the-art LLMs demonstrates that our method achieves superior security alignment performance compared to PRM-based approaches while reducing computational costs by 61\%. The framework incorporates transparent reporting and continuous audit mechanisms that enable iterative security improvement and regulatory compliance. Our contributions advance the field of efficient LLM security alignment by democratizing access to robust security measures for resource-constrained organizations and providing a scalable foundation for addressing evolving adversarial threats.

主题：	密码学与安全 (cs.CR) ; 人工智能 (cs.AI)
引用方式：	arXiv:2507.14202 [cs.CR]
	(或者 arXiv:2507.14202v1 [cs.CR] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.14202

提交历史

来自： Eric Du [查看电子邮件]
[v1] 星期一， 2025 年 7 月 14 日 17:41:12 UTC (65 KB)

计算机科学 > 密码学与安全

标题：通过红队和对抗训练实现大型模型的无PRM安全对齐

标题： PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 密码学与安全

标题： 通过红队和对抗训练实现大型模型的无PRM安全对齐 显示英文标题

标题： PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：通过红队和对抗训练实现大型模型的无PRM安全对齐