ZeShot-VQA: Zero-Shot Visual Question Answering Framework with Answer Mapping for Natural Disaster Damage Assessment

Karimi, Ehsan; Rahnemoonfar, Maryam

计算机科学 > 计算机视觉与模式识别

arXiv:2506.00238v1 (cs)

[提交于 2025年5月30日 ]

标题： ZeShot-VQA：基于答案映射的零样本视觉问答框架用于自然灾害损毁评估

标题： ZeShot-VQA: Zero-Shot Visual Question Answering Framework with Answer Mapping for Natural Disaster Damage Assessment

Authors:Ehsan Karimi, Maryam Rahnemoonfar

摘要：自然灾害通常影响广大地区并摧毁基础设施。及时有效地响应至关重要，以最大程度地减少对受灾社区的影响，而数据驱动的方法是最好的选择。视觉问答（VQA）模型帮助管理团队深入了解损害情况。然而，最近发布的模型不具备回答开放性问题的能力，只能从预定义的答案列表中选择最佳答案。如果我们想提出带有新可能答案的问题，这些答案不在预定义的列表中，则需要在新的收集和标注的数据集上对该模型进行微调/重新训练，这是一个耗时的过程。近年来，大规模视觉-语言模型（VLMs）引起了广泛关注。这些模型在大规模数据集上进行训练，并在单模态和多模态视觉/语言下游任务上表现出色，通常无需微调。本文提出了一种基于VLM的零样本VQA（ZeShot-VQA）方法，并研究了其在灾后FloodNet数据集上的性能。由于所提出的方法利用了零样本学习，因此可以在无需微调的情况下应用于新数据集。此外，ZeShot-VQA能够处理和生成在训练过程中未见过的答案，这展示了其灵活性。

摘要： Natural disasters usually affect vast areas and devastate infrastructures. Performing a timely and efficient response is crucial to minimize the impact on affected communities, and data-driven approaches are the best choice. Visual question answering (VQA) models help management teams to achieve in-depth understanding of damages. However, recently published models do not possess the ability to answer open-ended questions and only select the best answer among a predefined list of answers. If we want to ask questions with new additional possible answers that do not exist in the predefined list, the model needs to be fin-tuned/retrained on a new collected and annotated dataset, which is a time-consuming procedure. In recent years, large-scale Vision-Language Models (VLMs) have earned significant attention. These models are trained on extensive datasets and demonstrate strong performance on both unimodal and multimodal vision/language downstream tasks, often without the need for fine-tuning. In this paper, we propose a VLM-based zero-shot VQA (ZeShot-VQA) method, and investigate the performance of on post-disaster FloodNet dataset. Since the proposed method takes advantage of zero-shot learning, it can be applied on new datasets without fine-tuning. In addition, ZeShot-VQA is able to process and generate answers that has been not seen during the training procedure, which demonstrates its flexibility.

评论：	已被2025年IEEE地球科学与遥感国际研讨会（IGARSS 2025）接受
主题：	计算机视觉与模式识别 (cs.CV) ; 计算与语言 (cs.CL); 信息检索 (cs.IR); 机器学习 (cs.LG)
ACM 类：	I.2.7; I.2.10; I.5.1
引用方式：	arXiv:2506.00238 [cs.CV]
	(或者 arXiv:2506.00238v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.00238

提交历史

来自： Ehsan Karimi [查看电子邮件]
[v1] 星期五， 2025 年 5 月 30 日 21:15:11 UTC (1,783 KB)

计算机科学 > 计算机视觉与模式识别

标题： ZeShot-VQA：基于答案映射的零样本视觉问答框架用于自然灾害损毁评估

标题： ZeShot-VQA: Zero-Shot Visual Question Answering Framework with Answer Mapping for Natural Disaster Damage Assessment

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： ZeShot-VQA：基于答案映射的零样本视觉问答框架用于自然灾害损毁评估 显示英文标题

标题： ZeShot-VQA: Zero-Shot Visual Question Answering Framework with Answer Mapping for Natural Disaster Damage Assessment

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： ZeShot-VQA：基于答案映射的零样本视觉问答框架用于自然灾害损毁评估