LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts

Aswal, Darpan; Hudelot, Céline

计算机科学 > 计算与语言

arXiv:2508.16325v1 (cs)

[提交于 2025年8月22日 ]

标题： LLMSymGuard：一种利用可解释的越狱概念的符号安全护栏框架

标题： LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts

Authors:Darpan Aswal, Céline Hudelot

摘要：大型语言模型在各种应用中取得了成功；然而，由于存在多种类型的越狱方法，其安全性仍然令人担忧。尽管做出了大量努力，对齐和安全微调仅能提供一定程度的鲁棒性，以抵御那些隐秘地误导大型语言模型生成有害内容的越狱攻击。这使它们容易受到多种漏洞的影响，从有针对性的滥用到用户意外的特征分析。本工作引入了\textbf{LLMSymGuard}，一种新颖的框架，利用稀疏自编码器（SAEs）来识别与不同越狱主题相关的大型语言模型内部的可解释概念。通过提取语义上有意义的内部表示，LLMSymGuard能够构建符号化、逻辑化的安全护栏——在不牺牲模型能力或不需要进一步微调的情况下提供透明且稳健的防御。借助大型语言模型机制可解释性的进展，我们的方法表明大型语言模型可以从越狱中学习人类可解释的概念，并为设计更可解释和逻辑的安全防护措施提供了基础。代码将在发表后发布。

摘要： Large Language Models have found success in a variety of applications; however, their safety remains a matter of concern due to the existence of various types of jailbreaking methods. Despite significant efforts, alignment and safety fine-tuning only provide a certain degree of robustness against jailbreak attacks that covertly mislead LLMs towards the generation of harmful content. This leaves them prone to a number of vulnerabilities, ranging from targeted misuse to accidental profiling of users. This work introduces \textbf{LLMSymGuard}, a novel framework that leverages Sparse Autoencoders (SAEs) to identify interpretable concepts within LLM internals associated with different jailbreak themes. By extracting semantically meaningful internal representations, LLMSymGuard enables building symbolic, logical safety guardrails -- offering transparent and robust defenses without sacrificing model capabilities or requiring further fine-tuning. Leveraging advances in mechanistic interpretability of LLMs, our approach demonstrates that LLMs learn human-interpretable concepts from jailbreaks, and provides a foundation for designing more interpretable and logical safeguard measures against attackers. Code will be released upon publication.

主题：	计算与语言 (cs.CL) ; 人工智能 (cs.AI); 符号计算 (cs.SC)
引用方式：	arXiv:2508.16325 [cs.CL]
	(或者 arXiv:2508.16325v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.16325

提交历史

来自： Darpan Aswal [查看电子邮件]
[v1] 星期五， 2025 年 8 月 22 日 12:13:38 UTC (341 KB)

计算机科学 > 计算与语言

标题： LLMSymGuard：一种利用可解释的越狱概念的符号安全护栏框架

标题： LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： LLMSymGuard：一种利用可解释的越狱概念的符号安全护栏框架 显示英文标题

标题： LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： LLMSymGuard：一种利用可解释的越狱概念的符号安全护栏框架