Depth Gives a False Sense of Privacy: LLM Internal States Inversion

Dong, Tian; Meng, Yan; Li, Shaofeng; Chen, Guoxing; Liu, Zhen; Zhu, Haojin

计算机科学 > 密码学与安全

arXiv:2507.16372 (cs)

[提交于 2025年7月22日 ]

标题：深度给出了隐私的虚假感：LLM内部状态反转

标题： Depth Gives a False Sense of Privacy: LLM Internal States Inversion

Authors:Tian Dong, Yan Meng, Shaofeng Li, Guoxing Chen, Zhen Liu, Haojin Zhu

摘要：大型语言模型（LLMs）越来越多地融入日常流程，但它们引发了重要的隐私和安全问题。最近的研究提出了协作推理，将早期层的推理外包以确保数据本地性，并基于内部神经元模式引入了模型安全审计。这两种技术暴露了LLM的内部状态（ISs），这些状态传统上由于优化挑战和深度层的高度抽象表示而被认为不可逆。在这项工作中，我们通过提出四种反演攻击来挑战这一假设，这些攻击显著提高了反演输入的语义相似性和标记匹配率。具体来说，我们首先开发了两种针对低深度和高深度ISs的白盒优化攻击。这些攻击通过两阶段的反演过程避免了局部最小值收敛，这是之前工作中观察到的限制。然后，我们通过利用源LLM和衍生LLM之间的可迁移性，在更实际的黑盒权重访问下扩展了我们的优化攻击。此外，我们引入了一种基于生成的攻击，将反演视为一个翻译任务，使用反演模型来重建输入。对来自医疗咨询和代码协助数据集的短提示和长提示以及6个LLM的广泛评估验证了我们的反演攻击的有效性。值得注意的是，一个4,112个标记的医疗咨询提示可以在Llama-3模型的中间层中几乎完美地反演，标记匹配率为86.88 F1。最后，我们评估了四种我们发现无法完全防止ISs反演的实用防御措施，并为未来的缓解设计得出结论。

摘要： Large Language Models (LLMs) are increasingly integrated into daily routines, yet they raise significant privacy and safety concerns. Recent research proposes collaborative inference, which outsources the early-layer inference to ensure data locality, and introduces model safety auditing based on inner neuron patterns. Both techniques expose the LLM's Internal States (ISs), which are traditionally considered irreversible to inputs due to optimization challenges and the highly abstract representations in deep layers. In this work, we challenge this assumption by proposing four inversion attacks that significantly improve the semantic similarity and token matching rate of inverted inputs. Specifically, we first develop two white-box optimization-based attacks tailored for low-depth and high-depth ISs. These attacks avoid local minima convergence, a limitation observed in prior work, through a two-phase inversion process. Then, we extend our optimization attack under more practical black-box weight access by leveraging the transferability between the source and the derived LLMs. Additionally, we introduce a generation-based attack that treats inversion as a translation task, employing an inversion model to reconstruct inputs. Extensive evaluation of short and long prompts from medical consulting and coding assistance datasets and 6 LLMs validates the effectiveness of our inversion attacks. Notably, a 4,112-token long medical consulting prompt can be nearly perfectly inverted with 86.88 F1 token matching from the middle layer of Llama-3 model. Finally, we evaluate four practical defenses that we found cannot perfectly prevent ISs inversion and draw conclusions for future mitigation design.

评论：	被USENIX Security 2025接收。请引用本文为“董天，孟彦，李绍峰，陈国星，刘振，朱浩进。深度带来虚假的隐私感：LLM内部状态反转。在第34届USENIX安全研讨会（USENIX Security '25）上。”
主题：	密码学与安全 (cs.CR) ; 人工智能 (cs.AI)
引用方式：	arXiv:2507.16372 [cs.CR]
	(或者 arXiv:2507.16372v1 [cs.CR] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.16372

提交历史

来自： Tian Dong [查看电子邮件]
[v1] 星期二， 2025 年 7 月 22 日 09:15:11 UTC (2,423 KB)

计算机科学 > 密码学与安全

标题：深度给出了隐私的虚假感：LLM内部状态反转

标题： Depth Gives a False Sense of Privacy: LLM Internal States Inversion

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 密码学与安全

标题： 深度给出了隐私的虚假感：LLM内部状态反转 显示英文标题

标题： Depth Gives a False Sense of Privacy: LLM Internal States Inversion

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：深度给出了隐私的虚假感：LLM内部状态反转