RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation

Ma, Siju; Gong, Changsiyu; Fan, Xiaofeng; Ma, Yong; Jiang, Chengjie

Computer Science > Computer Vision and Pattern Recognition

arXiv:2509.12710 (cs)

[Submitted on 16 Sep 2025 ]

Title: RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation

Title: RIS-FUSION：从指代图像分割的角度重新思考文本驱动的红外与可见光图像融合

Authors:Siju Ma, Changsiyu Gong, Xiaofeng Fan, Yong Ma, Chengjie Jiang

Abstract: Text-driven infrared and visible image fusion has gained attention for enabling natural language to guide the fusion process. However, existing methods lack a goal-aligned task to supervise and evaluate how effectively the input text contributes to the fusion outcome. We observe that referring image segmentation (RIS) and text-driven fusion share a common objective: highlighting the object referred to by the text. Motivated by this, we propose RIS-FUSION, a cascaded framework that unifies fusion and RIS through joint optimization. At its core is the LangGatedFusion module, which injects textual features into the fusion backbone to enhance semantic alignment. To support multimodal referring image segmentation task, we introduce MM-RIS, a large-scale benchmark with 12.5k training and 3.5k testing triplets, each consisting of an infrared-visible image pair, a segmentation mask, and a referring expression. Extensive experiments show that RIS-FUSION achieves state-of-the-art performance, outperforming existing methods by over 11% in mIoU. Code and dataset will be released at https://github.com/SijuMa2003/RIS-FUSION.

Abstract: 通过文本驱动的红外和可见光图像融合已受到关注，因为它能够使自然语言指导融合过程。然而，现有方法缺乏一个目标对齐的任务来监督和评估输入文本对融合结果的有效性。我们观察到，指代图像分割（RIS）和文本驱动融合有一个共同的目标：突出文本所指的对象。受此启发，我们提出了RIS-FUSION，一种通过联合优化统一融合和RIS的级联框架。其核心是LangGatedFusion模块，该模块将文本特征注入融合主干以增强语义对齐。为了支持多模态指代图像分割任务，我们引入了MM-RIS，这是一个大规模基准，包含12.5k个训练和3.5k个测试三元组，每个三元组包括一对红外-可见光图像、一个分割掩码和一个指代表达。大量实验表明，RIS-FUSION取得了最先进的性能，在mIoU上超过了现有方法超过11%。代码和数据集将在https://github.com/SijuMa2003/RIS-FUSION发布。

Comments:	5 pages, 2 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2509.12710 [cs.CV]
	(or arXiv:2509.12710v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2509.12710

Submission history

From: Chengjie Jiang [view email]
[v1] Tue, 16 Sep 2025 06:03:15 UTC (696 KB)

Computer Science > Computer Vision and Pattern Recognition

Title: RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation

Title: RIS-FUSION：从指代图像分割的角度重新思考文本驱动的红外与可见光图像融合

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title: RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation Show Chinese title

Title: RIS-FUSION：从指代图像分割的角度重新思考文本驱动的红外与可见光图像融合

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation