ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

Wan, Zifu; Zhang, Ce; Yong, Silong; Ma, Martin Q.; Stepputtis, Simon; Morency, Louis-Philippe; Ramanan, Deva; Sycara, Katia; Xie, Yaqi

计算机科学 > 计算机视觉与模式识别

arXiv:2507.00898 (cs)

[提交于 2025年7月1日 ]

标题：仅：单层干预足以减轻大型视觉-语言模型中的幻觉

标题： ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

Authors:Zifu Wan, Ce Zhang, Silong Yong, Martin Q. Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, Yaqi Xie

摘要：近年来的大型视觉-语言模型（LVLMs）通过文本响应引入了一种理解与推理图像输入的新范式。尽管它们在多种多模态任务中取得了显著的性能，但它们面临着幻觉这一持续性的挑战，这会引入实际的弱点，并引发对其在现实应用中可靠部署的担忧。现有工作已探索对比解码方法以缓解此问题，其中原始LVLM的输出与扰动版本的输出进行比较和对比。然而，这些方法需要两次或更多次查询，这会减慢LVLM的响应生成，使其不太适合实时应用。为了克服这一限制，我们提出了ONLY，一种无需训练的解码方法，在解码过程中只需一次查询和一层干预，从而实现高效的实时部署。具体来说，我们通过使用每个标记的文本到视觉熵比率，有选择地增强文本输出中的关键文本信息。广泛的实验结果表明，我们的ONLY在各种基准测试中始终优于最先进方法，同时需要最少的实现努力和计算成本。代码可在 https://github.com/zifuwan/ONLY 获取。

摘要： Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at https://github.com/zifuwan/ONLY.

评论：	被ICCV 2025接收。项目页面：https://zifuwan.github.io/ONLY/
主题：	计算机视觉与模式识别 (cs.CV) ; 计算与语言 (cs.CL)
引用方式：	arXiv:2507.00898 [cs.CV]
	(或者 arXiv:2507.00898v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.00898

提交历史

来自： Zifu Wan [查看电子邮件]
[v1] 星期二， 2025 年 7 月 1 日 16:01:08 UTC (6,265 KB)

计算机科学 > 计算机视觉与模式识别

标题：仅：单层干预足以减轻大型视觉-语言模型中的幻觉

标题： ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： 仅：单层干预足以减轻大型视觉-语言模型中的幻觉 显示英文标题

标题： ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：仅：单层干预足以减轻大型视觉-语言模型中的幻觉