INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

Dong, Xin; Dong, Shichao; Wang, Jin; Huang, Jing; Zhou, Li; Sun, Zenghui; Jing, Lihua; Lan, Jingsong; Zhu, Xiaoyong; Zheng, Bo

计算机科学 > 计算机视觉与模式识别

arXiv:2507.05056 (cs)

[提交于 2025年7月7日 ]

标题： INTER：通过交互引导采样减轻大型视觉语言模型中的幻觉

标题： INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

Authors:Xin Dong, Shichao Dong, Jin Wang, Jing Huang, Li Zhou, Zenghui Sun, Lihua Jing, Jingsong Lan, Xiaoyong Zhu, Bo Zheng

摘要：大型视觉语言模型（LVLMs）中的幻觉对实际应用构成了重大挑战，因为LVLMs可能会生成看似合理但与相关视觉内容不一致的响应。这一问题在人类认知中很少出现。我们认为这种差异源于人类能够有效利用数据样本中的多模态交互信息。具体来说，人类通常首先收集多模态信息，分析不同模态之间的交互以理解内容，然后通过语言表达其理解。受这一观察启发，我们在流行的LVLMs上进行了广泛的实验，并获得了令人惊讶的见解，揭示了LVLMs在多模态样本上的类人但较不明显的认知行为。基于这些发现，我们进一步提出了\textbf{互动}：\textbf{交互}动作引导采样，一种新颖的无需训练的算法，能够在不需额外数据的情况下减轻幻觉。具体而言，INTER明确指导LVLMs在生成响应时有效地重新应用其对多模态交互信息的理解，从而减少潜在的幻觉。在包括VQA和图像描述任务在内的六个基准测试中，与最先进的解码策略相比，INTER在五种LVLMs上的平均改进高达3.4%。论文被接受后将发布代码。

摘要： Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans' ability to effectively leverage multimodal interaction information in data samples. Specifically, humans typically first gather multimodal information, analyze the interactions across modalities for understanding, and then express their understanding through language. Motivated by this observation, we conduct extensive experiments on popular LVLMs and obtained insights that surprisingly reveal human-like, though less pronounced, cognitive behavior of LVLMs on multimodal samples. Building on these findings, we further propose \textbf{INTER}: \textbf{Inter}action Guidance Sampling, a novel training-free algorithm that mitigate hallucinations without requiring additional data. Specifically, INTER explicitly guides LVLMs to effectively reapply their understanding of multimodal interaction information when generating responses, thereby reducing potential hallucinations. On six benchmarks including VQA and image captioning tasks, INTER achieves an average improvement of up to 3.4\% on five LVLMs compared to the state-of-the-art decoding strategy. The code will be released when the paper is accepted.

主题：	计算机视觉与模式识别 (cs.CV) ; 人工智能 (cs.AI)
引用方式：	arXiv:2507.05056 [cs.CV]
	(或者 arXiv:2507.05056v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.05056

提交历史

来自： Xin Dong [查看电子邮件]
[v1] 星期一， 2025 年 7 月 7 日 14:38:53 UTC (4,879 KB)

计算机科学 > 计算机视觉与模式识别

标题： INTER：通过交互引导采样减轻大型视觉语言模型中的幻觉

标题： INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： INTER：通过交互引导采样减轻大型视觉语言模型中的幻觉 显示英文标题

标题： INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： INTER：通过交互引导采样减轻大型视觉语言模型中的幻觉