OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination

Chen, Junzhe; Zhang, Tianshu; Huang, Shiyu; Niu, Yuwei; Sun, Chao; Zhang, Rongzhou; Zhou, Guanyu; Wen, Lijie; Hu, Xuming

计算机科学 > 人工智能

arXiv:2509.00723 (cs)

[提交于 2025年8月31日 ]

标题： OmniDPO：一种解决全模态幻觉的偏好优化框架

标题： OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination

Authors:Junzhe Chen, Tianshu Zhang, Shiyu Huang, Yuwei Niu, Chao Sun, Rongzhou Zhang, Guanyu Zhou, Lijie Wen, Xuming Hu

摘要：最近，全模态大型语言模型（OLLMs）引发了新一轮的研究热潮，在音频视频理解以及实时环境感知等任务中取得了令人印象深刻的结果。然而，幻觉问题仍然存在。与双模态设置类似，文本模态的先验知识往往占主导地位，导致OLLMs更依赖文本线索而忽视视觉和音频信息。此外，完全多模态场景引入了新的挑战。大多数现有模型在训练期间独立对齐视觉或听觉模态与文本，而忽略了视频与其对应音频之间的内在关联。这种疏忽导致在需要解释嵌入在视频内容中的隐藏音频线索时出现幻觉。为了解决这些挑战，我们提出了 OmniDPO，一种用于减轻OLLMs中幻觉的偏好对齐框架。具体而言，OmniDPO结合了两种策略：(1) 构建文本偏好样本对，以增强模型对音视频交互的理解；以及(2) 构建多模态偏好样本对，以加强模型对视觉和听觉信息的关注。通过解决这两个挑战，OmniDPO有效提高了多模态基础，并减少了幻觉。在两个OLLMs上进行的实验表明，OmniDPO不仅能够有效缓解多模态幻觉，还能显著提升模型在不同模态上的推理能力。所有代码和数据集将在论文接受后发布。

摘要： Recently, Omni-modal large language models (OLLMs) have sparked a new wave of research, achieving impressive results in tasks such as audio-video understanding and real-time environment perception. However, hallucination issues still persist. Similar to the bimodal setting, the priors from the text modality tend to dominate, leading OLLMs to rely more heavily on textual cues while neglecting visual and audio information. In addition, fully multimodal scenarios introduce new challenges. Most existing models align visual or auditory modalities with text independently during training, while ignoring the intrinsic correlations between video and its corresponding audio. This oversight results in hallucinations when reasoning requires interpreting hidden audio cues embedded in video content. To address these challenges, we propose OmniDPO, a preference-alignment framework designed to mitigate hallucinations in OLLMs. Specifically, OmniDPO incorporates two strategies: (1) constructing text-preference sample pairs to enhance the model's understanding of audio-video interactions; and (2) constructing multimodal-preference sample pairs to strengthen the model's attention to visual and auditory information. By tackling both challenges, OmniDPO effectively improves multimodal grounding and reduces hallucination. Experiments conducted on two OLLMs demonstrate that OmniDPO not only effectively mitigates multimodal hallucinations but also significantly enhances the models' reasoning capabilities across modalities. All code and datasets will be released upon paper acceptance.

主题：	人工智能 (cs.AI) ; 多媒体 (cs.MM)
引用方式：	arXiv:2509.00723 [cs.AI]
	(或者 arXiv:2509.00723v1 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.2509.00723

提交历史

来自： Tianshu Zhang [查看电子邮件]
[v1] 星期日， 2025 年 8 月 31 日 07:19:32 UTC (9,509 KB)

计算机科学 > 人工智能

标题： OmniDPO：一种解决全模态幻觉的偏好优化框架

标题： OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： OmniDPO：一种解决全模态幻觉的偏好优化框架 显示英文标题

标题： OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： OmniDPO：一种解决全模态幻觉的偏好优化框架