Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying

Xue, Youze; Li, Dian; Liu, Gang

计算机科学 > 计算机视觉与模式识别

arXiv:2506.02020v1 (cs)

[提交于 2025年5月28日 ]

标题：通过显式硬负样本梯度放大改进多模态嵌入学习

标题： Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying

Authors:Youze Xue, Dian Li, Gang Liu

摘要：随着近年来多模态大型语言模型（MLLMs）的快速发展，基础的对比语言-图像预训练（CLIP）框架已成功扩展到MLLMs，为广泛的检索任务提供了更强大且通用的多模态嵌入。尽管取得了这些进展，从CLIP风格模型到MLLMs的核心对比学习范式基本没有改变。在此框架内，有效挖掘困难负样本仍然是提升性能的关键因素。以往的工作引入了离线和在线策略来挖掘困难负样本，以提高对比学习的效率。虽然这些方法改进了多模态嵌入，但每个困难负样本对学习过程的具体贡献尚未被深入研究。在这项工作中，我们详细分析了info-NCE损失函数关于查询、正样本和负样本的梯度，阐明了困难负样本在更新模型参数中的作用。基于此分析，我们提出显式放大与困难负样本相关的梯度，从而鼓励模型学习更具区分性的嵌入。我们的多模态嵌入模型采用所提出的显式梯度放大器，并基于LLaVA-OneVision-7B架构，在MMEB基准测试中相比利用相同MLLM主干的先前方法实现了最先进的性能。此外，当与我们自主研发的MLLM QQMM结合时，我们的方法在MMEB排行榜上达到了顶级排名。代码和模型已在https://github.com/QQ-MM/QQMM-embed 上发布。

摘要： With the rapid advancement of multi-modal large language models (MLLMs) in recent years, the foundational Contrastive Language-Image Pretraining (CLIP) framework has been successfully extended to MLLMs, enabling more powerful and universal multi-modal embeddings for a wide range of retrieval tasks. Despite these developments, the core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs. Within this framework, the effective mining of hard negative samples continues to be a critical factor for enhancing performance. Prior works have introduced both offline and online strategies for hard negative mining to improve the efficiency of contrastive learning. While these approaches have led to improved multi-modal embeddings, the specific contribution of each hard negative sample to the learning process has not been thoroughly investigated. In this work, we conduct a detailed analysis of the gradients of the info-NCE loss with respect to the query, positive, and negative samples, elucidating the role of hard negatives in updating model parameters. Building upon this analysis, we propose to explicitly amplify the gradients associated with hard negative samples, thereby encouraging the model to learn more discriminative embeddings. Our multi-modal embedding model, trained with the proposed Explicit Gradient Amplifier and based on the LLaVA-OneVision-7B architecture, achieves state-of-the-art performance on the MMEB benchmark compared to previous methods utilizing the same MLLM backbone. Furthermore, when integrated with our self-developed MLLM, QQMM, our approach attains the top rank on the MMEB leaderboard. Code and models are released on https://github.com/QQ-MM/QQMM-embed.

主题：	计算机视觉与模式识别 (cs.CV) ; 机器学习 (cs.LG)
引用方式：	arXiv:2506.02020 [cs.CV]
	(或者 arXiv:2506.02020v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.02020

提交历史

来自： Youze Xue [查看电子邮件]
[v1] 星期三， 2025 年 5 月 28 日 11:18:19 UTC (323 KB)

计算机科学 > 计算机视觉与模式识别

标题：通过显式硬负样本梯度放大改进多模态嵌入学习

标题： Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： 通过显式硬负样本梯度放大改进多模态嵌入学习 显示英文标题

标题： Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：通过显式硬负样本梯度放大改进多模态嵌入学习