How Do Vision-Language Models Process Conflicting Information Across Modalities?

Hua, Tianze; Yun, Tian; Pavlick, Ellie

计算机科学 > 计算与语言

arXiv:2507.01790v1 (cs)

[提交于 2025年7月2日 ]

标题：视觉-语言模型如何处理跨模态的冲突信息？

标题： How Do Vision-Language Models Process Conflicting Information Across Modalities?

Authors:Tianze Hua, Tian Yun, Ellie Pavlick

摘要：人工智能模型越来越需要具备多模态能力，将不同的输入流整合成一个连贯的状态表示，后续的行为和动作可以基于此进行。本文旨在了解当输入流呈现冲突信息时，这类模型会如何表现。特别关注视觉-语言模型，我们提供不一致的输入（例如，一张狗的图片配上标题“一张猫的照片”），并让模型报告特定模态中的信息（例如， “标题说了什么 / 图片里有什么？”）。我们发现，模型通常会偏爱某一模态而非另一模态，例如，无论标题说什么，都报告图片内容，但不同模型在偏好的模态上有所不同。我们发现行为上偏好的模态在模型的内部表示结构中有所体现，而且特定的注意力头可以重新构建表示，以偏爱某一模态而非另一模态。此外，我们发现了与模态无关的“路由器头”，它们似乎促进对指令中请求的模态的回答，并且可以通过操纵或迁移来提高在不同数据集和模态上的性能。总之，这项工作为识别和控制模型在复杂多模态环境中检测和解决冲突信号的方式提供了关键步骤。

摘要： AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-language models, we provide inconsistent inputs (e.g., an image of a dog paired with the caption "A photo of a cat") and ask the model to report the information present in one of the specific modalities (e.g., "What does the caption say / What is in the image?"). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model, and that specific attention heads can restructure the representations to favor one modality over the other. Moreover, we find modality-agnostic "router heads" which appear to promote answers about the modality requested in the instruction, and which can be manipulated or transferred in order to improve performance across datasets and modalities. Together, the work provides essential steps towards identifying and controlling if and how models detect and resolve conflicting signals within complex multimodal environments.

评论：	所有代码和资源均可在以下地址获取： https://github.com/ethahtz/vlm_conflicting_info_processing
主题：	计算与语言 (cs.CL) ; 人工智能 (cs.AI); 计算机视觉与模式识别 (cs.CV); 机器学习 (cs.LG)
引用方式：	arXiv:2507.01790 [cs.CL]
	(或者 arXiv:2507.01790v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.01790

提交历史

来自： Tianze Hua [查看电子邮件]
[v1] 星期三， 2025 年 7 月 2 日 15:15:14 UTC (6,642 KB)

计算机科学 > 计算与语言

标题：视觉-语言模型如何处理跨模态的冲突信息？

标题： How Do Vision-Language Models Process Conflicting Information Across Modalities?

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： 视觉-语言模型如何处理跨模态的冲突信息？ 显示英文标题

标题： How Do Vision-Language Models Process Conflicting Information Across Modalities?

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：视觉-语言模型如何处理跨模态的冲突信息？