Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

Li, Shuang; Leng, Jiaxu; Kuang, Changjiang; Tan, Mingpi; Gao, Xinbo

计算机科学 > 计算机视觉与模式识别

arXiv:2506.02439 (cs)

[提交于 2025年6月3日 ]

标题：基于视频的语言驱动的可见光-红外视频人再识别

标题： Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

Authors:Shuang Li, Jiaxu Leng, Changjiang Kuang, Mingpi Tan, Xinbo Gao

摘要：基于视频的可见光-红外行人再识别（VVI-ReID）旨在通过提取跨模态的序列级不变特征来匹配行人序列。作为一种高级语义表示，语言能够在红外和可见光两种模态下提供一致的行人特征描述。利用对比语言图像预训练（CLIP）模型生成视频级语言提示，并指导跨模态序列级不变特征的学习在理论上是可行的。然而，生成和利用跨模态共享的视频级语言提示以解决模态差异的问题仍然是一个关键挑战。为了解决这个问题，我们提出了一种简单而强大的框架——视频级语言驱动的VVI-ReID（VLD），该框架包括两个核心模块：不变模态语言提示（IMLP）和时空提示（STP）。 IMLP采用视觉编码器和提示学习器的联合微调策略，有效生成跨模态文本提示，并在CLIP的多模态空间中将其与不同模态的视觉特征对齐，从而减轻模态差异。此外，STP通过两个子模块——时空中心（STH）和时空聚合（STA）来建模时空信息，进一步增强IMLP，将时空信息融入文本提示中。 STH在视觉变换器（ViT）各层的每帧[CLS]标记中聚集和扩散时空信息，而STA引入了专门的身份级损失和多头注意力机制，确保STH专注于身份相关的时空特征聚合。 VLD框架在两个VVI-ReID基准数据集上取得了最先进的结果。代码将在https://github.com/Visuang/VLD发布。

摘要： Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model to generate video-level language prompts and guide the learning of modality-invariant sequence-level features is theoretically feasible. However, the challenge of generating and utilizing modality-shared video-level language prompts to address modality gaps remains a critical problem. To address this problem, we propose a simple yet powerful framework, video-level language-driven VVI-ReID (VLD), which consists of two core modules: invariant-modality language prompting (IMLP) and spatial-temporal prompting (STP). IMLP employs a joint fine-tuning strategy for the visual encoder and the prompt learner to effectively generate modality-shared text prompts and align them with visual features from different modalities in CLIP's multimodal space, thereby mitigating modality differences. Additionally, STP models spatiotemporal information through two submodules, the spatial-temporal hub (STH) and spatial-temporal aggregation (STA), which further enhance IMLP by incorporating spatiotemporal information into text prompts. The STH aggregates and diffuses spatiotemporal information into the [CLS] token of each frame across the vision transformer (ViT) layers, whereas STA introduces dedicated identity-level loss and specialized multihead attention to ensure that the STH focuses on identity-relevant spatiotemporal feature aggregation. The VLD framework achieves state-of-the-art results on two VVI-ReID benchmarks. The code will be released at https://github.com/Visuang/VLD.

评论：	已被IEEE TIFS接受
主题：	计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2506.02439 [cs.CV]
	(或者 arXiv:2506.02439v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.02439

提交历史

来自： Shuang Li [查看电子邮件]
[v1] 星期二， 2025 年 6 月 3 日 04:49:08 UTC (2,894 KB)

计算机科学 > 计算机视觉与模式识别

标题：基于视频的语言驱动的可见光-红外视频人再识别

标题： Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： 基于视频的语言驱动的可见光-红外视频人再识别 显示英文标题

标题： Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基于视频的语言驱动的可见光-红外视频人再识别