ViTA-PAR: Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition

Park, Minjeong; Park, Hongbeen; Kim, Jinkyu

计算机科学 > 计算机视觉与模式识别

arXiv:2506.01411 (cs)

[提交于 2025年6月2日 ]

标题： ViTA-PAR：基于属性提示的视觉和文本属性对齐用于行人属性识别

标题： ViTA-PAR: Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition

Authors:Minjeong Park, Hongbeen Park, Jinkyu Kim

摘要：行人属性识别（PAR）任务旨在识别个体的各种详细属性，如服装、配饰和性别。为了提高PAR性能，模型必须捕获从粗粒度的全局属性（例如，用于识别性别）到可能出现在不同区域的细粒度局部细节（例如，用于识别配饰）的特征。最近的研究表明，人体部位表示可以增强模型的鲁棒性和准确性，但这些方法通常局限于固定水平区域内的属性类别，导致当属性出现在变化或意想不到的身体位置时，性能会下降。本文提出了一种名为ViTA-PAR（基于属性提示的视觉与文本属性对齐用于行人属性识别）的方法，通过专门的多模态提示和视觉-语言对齐来增强属性识别。我们引入了捕捉全局到局部语义的视觉属性提示，从而实现多样化的属性表示。为了丰富文本嵌入，我们设计了一个可学习的提示模板，称为人物和属性上下文提示，以学习人物和属性上下文。最后，我们将视觉和文本属性特征对齐以实现有效融合。 ViTA-PAR在四个PAR基准上进行了验证，实现了高效的推理和具有竞争力的性能。我们的代码和模型可在https://github.com/mlnjeongpark/ViTA-PAR获取。

摘要： The Pedestrian Attribute Recognition (PAR) task aims to identify various detailed attributes of an individual, such as clothing, accessories, and gender. To enhance PAR performance, a model must capture features ranging from coarse-grained global attributes (e.g., for identifying gender) to fine-grained local details (e.g., for recognizing accessories) that may appear in diverse regions. Recent research suggests that body part representation can enhance the model's robustness and accuracy, but these methods are often restricted to attribute classes within fixed horizontal regions, leading to degraded performance when attributes appear in varying or unexpected body locations. In this paper, we propose Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition, dubbed as ViTA-PAR, to enhance attribute recognition through specialized multimodal prompting and vision-language alignment. We introduce visual attribute prompts that capture global-to-local semantics, enabling diverse attribute representations. To enrich textual embeddings, we design a learnable prompt template, termed person and attribute context prompting, to learn person and attributes context. Finally, we align visual and textual attribute features for effective fusion. ViTA-PAR is validated on four PAR benchmarks, achieving competitive performance with efficient inference. We release our code and model at https://github.com/mlnjeongpark/ViTA-PAR.

评论：	已被IEEE ICIP 2025接受
主题：	计算机视觉与模式识别 (cs.CV) ; 人工智能 (cs.AI)
引用方式：	arXiv:2506.01411 [cs.CV]
	(或者 arXiv:2506.01411v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.01411

提交历史

来自： Minjeong Park [查看电子邮件]
[v1] 星期一， 2025 年 6 月 2 日 08:07:06 UTC (605 KB)

计算机科学 > 计算机视觉与模式识别

标题： ViTA-PAR：基于属性提示的视觉和文本属性对齐用于行人属性识别

标题： ViTA-PAR: Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： ViTA-PAR：基于属性提示的视觉和文本属性对齐用于行人属性识别 显示英文标题

标题： ViTA-PAR: Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： ViTA-PAR：基于属性提示的视觉和文本属性对齐用于行人属性识别