LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

Li, Chloe; Phuong, Mary; Siegel, Noah Y.

计算机科学 > 密码学与安全

arXiv:2508.00943 (cs)

[提交于 2025年7月31日 ]

标题： LLMs 可以在链式思维监控下隐秘地降低能力评估

标题： LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

Authors:Chloe Li, Mary Phuong, Noah Y. Siegel

摘要：对危险能力的可信评估对于确定AI系统是否安全部署变得越来越重要。一个经验上证明的威胁是装傻——AI模型或其开发人员在评估中战略性地表现不佳。一种有希望的防御方法是监控模型的思维链（CoT）推理，因为这可能揭示其意图和计划。在本工作中，我们通过提示模型在被监控无知或监控意识的情况下进行装傻，来衡量模型在危险能力评估中对CoT监控的装傻能力。我们表明，前沿模型和小型开源模型可以在没有提示的情况下0次射击地隐蔽装傻。然而，它们目前还不能可靠地做到这一点：当处于监控意识状态时，它们在成功装傻的情况下有16-36%的时间绕过监控。我们定性分析了未被发现的CoT，以了解监控为何失败。我们揭示了CoT监控的丰富攻击面，并贡献了由模型生成的五个隐蔽装傻策略。这些结果说明了CoT监控的潜在故障模式，并可能有助于构建更多样化的装傻模型生物。

摘要： Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat to this is sandbagging - the strategic underperformance on evaluations by AI models or their developers. One promising defense is to monitor a model's chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while being either monitor-oblivious or monitor-aware. We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints. However, they cannot yet do so reliably: they bypass the monitor 16-36\% of the time when monitor-aware, conditioned on sandbagging successfully. We qualitatively analyzed the uncaught CoTs to understand why the monitor failed. We reveal a rich attack surface for CoT monitoring and contribute five covert sandbagging policies generated by models. These results inform potential failure modes of CoT monitoring and may help build more diverse sandbagging model organisms.

评论：	25页，9图
主题：	密码学与安全 (cs.CR) ; 人工智能 (cs.AI)
引用方式：	arXiv:2508.00943 [cs.CR]
	(或者 arXiv:2508.00943v1 [cs.CR] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.00943

提交历史

来自： Chloe Li [查看电子邮件]
[v1] 星期四， 2025 年 7 月 31 日 15:19:30 UTC (2,051 KB)

计算机科学 > 密码学与安全

标题： LLMs 可以在链式思维监控下隐秘地降低能力评估

标题： LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 密码学与安全

标题： LLMs 可以在链式思维监控下隐秘地降低能力评估 显示英文标题

标题： LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： LLMs 可以在链式思维监控下隐秘地降低能力评估