RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model

Wen, Congcong; Lin, Yiting; Qu, Xiaokang; Li, Nan; Liao, Yong; Lin, Hui; Li, Xiang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2504.04988 (cs)

[Submitted on 7 Apr 2025 ]

Title: RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model

Title: RS-RAG：利用多模态数据集和检索增强生成模型连接遥感图像与综合知识

Authors:Congcong Wen, Yiting Lin, Xiaokang Qu, Nan Li, Yong Liao, Hui Lin, Xiang Li

Abstract: Recent progress in VLMs has demonstrated impressive capabilities across a variety of tasks in the natural image domain. Motivated by these advancements, the remote sensing community has begun to adopt VLMs for remote sensing vision-language tasks, including scene understanding, image captioning, and visual question answering. However, existing remote sensing VLMs typically rely on closed-set scene understanding and focus on generic scene descriptions, yet lack the ability to incorporate external knowledge. This limitation hinders their capacity for semantic reasoning over complex or context-dependent queries that involve domain-specific or world knowledge. To address these challenges, we first introduced a multimodal Remote Sensing World Knowledge (RSWK) dataset, which comprises high-resolution satellite imagery and detailed textual descriptions for 14,141 well-known landmarks from 175 countries, integrating both remote sensing domain knowledge and broader world knowledge. Building upon this dataset, we proposed a novel Remote Sensing Retrieval-Augmented Generation (RS-RAG) framework, which consists of two key components. The Multi-Modal Knowledge Vector Database Construction module encodes remote sensing imagery and associated textual knowledge into a unified vector space. The Knowledge Retrieval and Response Generation module retrieves and re-ranks relevant knowledge based on image and/or text queries, and incorporates the retrieved content into a knowledge-augmented prompt to guide the VLM in producing contextually grounded responses. We validated the effectiveness of our approach on three representative vision-language tasks, including image captioning, image classification, and visual question answering, where RS-RAG significantly outperformed state-of-the-art baselines.

Abstract: 最近在视觉语言模型（VLMs）方面的进展展示了其在自然图像领域各种任务中的出色能力。受这些进步的启发，遥感领域开始采用VLMs进行遥感视觉语言任务，包括场景理解、图像描述和视觉问答。然而，现有的遥感VLMs通常依赖于封闭集场景理解，并专注于通用场景描述，却缺乏整合外部知识的能力。这一限制阻碍了它们在涉及领域特定或世界知识的复杂或上下文相关查询中的语义推理能力。为了解决这些挑战，我们首先引入了一个多模态遥感世界知识（RSWK）数据集，该数据集包含来自175个国家的14,141个著名地标高分辨率卫星图像和详细文本描述，融合了遥感领域知识和更广泛的世界知识。基于该数据集，我们提出了一种新颖的遥感检索增强生成（RS-RAG）框架，该框架包含两个关键组件。多模态知识向量数据库构建模块将遥感图像和相关文本知识编码到统一的向量空间中。知识检索与响应生成模块根据图像和/或文本查询检索和重新排序相关知识，并将检索内容整合到知识增强提示中，以指导VLM生成上下文相关的响应。我们在三个具有代表性的视觉语言任务上验证了我们方法的有效性，包括图像描述、图像分类和视觉问答，在这些任务中，RS-RAG显著优于最先进的基线。

Subjects:	Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI)
Cite as:	arXiv:2504.04988 [cs.CV]
	(or arXiv:2504.04988v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2504.04988

Submission history

From: Congcong Wen [view email]
[v1] Mon, 7 Apr 2025 12:13:43 UTC (6,242 KB)

Computer Science > Computer Vision and Pattern Recognition

Title: RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model

Title: RS-RAG：利用多模态数据集和检索增强生成模型连接遥感图像与综合知识

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title: RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model Show Chinese title

Title: RS-RAG：利用多模态数据集和检索增强生成模型连接遥感图像与综合知识

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model