Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

Fonseca, Joao; Bell, Andrew; Stoyanovich, Julia

计算机科学 > 计算与语言

arXiv:2501.02018 (cs)

[提交于 2025年1月2日 ]

标题：实时保护大型语言模型，具有可调节的安全性与性能权衡

标题： Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

Authors:Joao Fonseca, Andrew Bell, Julia Stoyanovich

摘要：大型语言模型（LLMs）已被证明容易受到越狱攻击，或被用于引诱模型产生高风险行为的对抗性攻击。越狱已被网络犯罪分子和黑帽参与者利用，造成重大危害，突显了保护广泛部署模型的紧迫性。保护方法，包括微调模型或让LLM“自我反思”，可能会延长模型的推理时间，产生计算代价，降低输出的语义流畅性，并限制“正常”的模型行为。重要的是，这些安全-性能权衡（SPTs）仍然是一个研究不足的领域。在本工作中，我们引入了一种新的保护方法，称为SafeNudge，它结合了受控文本生成与“引导”，即使用文本干预来改变模型的行为。 SafeNudge在执行越狱攻击时触发，并可通过引导LLM走向安全响应，将成功的越狱尝试减少30%。它对推理增加的延迟最小，并对输出的语义流畅性影响可以忽略不计。此外，我们允许可调节的SPTs。 SafeNudge是开源的，可通过https://pypi.org/获取，并与使用Hugging Face“transformers”库加载的模型兼容。

摘要： Large Language Models (LLMs) have been shown to be susceptible to jailbreak attacks, or adversarial attacks used to illicit high risk behavior from a model. Jailbreaks have been exploited by cybercriminals and blackhat actors to cause significant harm, highlighting the critical need to safeguard widely-deployed models. Safeguarding approaches, which include fine-tuning models or having LLMs "self-reflect", may lengthen the inference time of a model, incur a computational penalty, reduce the semantic fluency of an output, and restrict ``normal'' model behavior. Importantly, these Safety-Performance Trade-offs (SPTs) remain an understudied area. In this work, we introduce a novel safeguard, called SafeNudge, that combines Controlled Text Generation with "nudging", or using text interventions to change the behavior of a model. SafeNudge triggers during text-generation while a jailbreak attack is being executed, and can reduce successful jailbreak attempts by 30% by guiding the LLM towards a safe responses. It adds minimal latency to inference and has a negligible impact on the semantic fluency of outputs. Further, we allow for tunable SPTs. SafeNudge is open-source and available through https://pypi.org/, and is compatible with models loaded with the Hugging Face "transformers" library.

主题：	计算与语言 (cs.CL) ; 人工智能 (cs.AI); 密码学与安全 (cs.CR); 机器学习 (cs.LG)
引用方式：	arXiv:2501.02018 [cs.CL]
	(或者 arXiv:2501.02018v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.02018

提交历史

来自： Andrew Bell [查看电子邮件]
[v1] 星期四， 2025 年 1 月 2 日 15:15:38 UTC (763 KB)

计算机科学 > 计算与语言

标题：实时保护大型语言模型，具有可调节的安全性与性能权衡

标题： Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： 实时保护大型语言模型，具有可调节的安全性与性能权衡 显示英文标题

标题： Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：实时保护大型语言模型，具有可调节的安全性与性能权衡