RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding

Xiao, Xi; Zhang, Yunbei; Wang, Janet; Zhao, Lin; Wei, Yuxiang; Li, Hengjia; Li, Yanshu; Wang, Xiao; Roy, Swalpa Kumar; Xu, Hao; Wang, Tianyang

计算机科学 > 计算工程、金融与科学

arXiv:2507.17353 (cs)

[提交于 2025年7月23日 ]

标题： RoadBench：一种用于道路损伤理解的视觉-语言基础模型和基准

标题： RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding

Authors:Xi Xiao, Yunbei Zhang, Janet Wang, Lin Zhao, Yuxiang Wei, Hengjia Li, Yanshu Li, Xiao Wang, Swalpa Kumar Roy, Hao Xu, Tianyang Wang

摘要：准确的道路损坏检测对于及时的基础设施维护和公共安全至关重要，但现有的仅视觉数据集和模型缺乏文本信息可以提供的丰富上下文理解。为解决这一限制，我们引入了RoadBench，这是首个用于全面道路损坏理解的多模态基准。该数据集将高分辨率的道路损坏图像与详细的文本描述配对，为模型训练提供了更丰富的上下文。我们还提出了RoadCLIP，一种新颖的视觉语言模型，它在CLIP的基础上集成了领域特定的增强功能。它包括一种疾病感知的位置编码，可以捕捉道路缺陷的空间模式，并包含一种注入道路状况先验机制，以改进模型对道路损坏的理解。我们进一步采用GPT驱动的数据生成管道来扩展RoadBench中的图像到文本对，大大增加了数据多样性，而无需进行详尽的手动标注。实验表明，RoadCLIP在道路损坏识别任务中达到了最先进的性能，比现有仅视觉模型显著提升了19.2%。这些结果突显了整合视觉和文本信息在增强道路状况分析方面的优势，为该领域设定了新的基准，并通过多模态学习为更有效的基础设施监测铺平了道路。

摘要： Accurate road damage detection is crucial for timely infrastructure maintenance and public safety, but existing vision-only datasets and models lack the rich contextual understanding that textual information can provide. To address this limitation, we introduce RoadBench, the first multimodal benchmark for comprehensive road damage understanding. This dataset pairs high resolution images of road damages with detailed textual descriptions, providing a richer context for model training. We also present RoadCLIP, a novel vision language model that builds upon CLIP by integrating domain specific enhancements. It includes a disease aware positional encoding that captures spatial patterns of road defects and a mechanism for injecting road-condition priors to refine the model's understanding of road damages. We further employ a GPT driven data generation pipeline to expand the image to text pairs in RoadBench, greatly increasing data diversity without exhaustive manual annotation. Experiments demonstrate that RoadCLIP achieves state of the art performance on road damage recognition tasks, significantly outperforming existing vision-only models by 19.2%. These results highlight the advantages of integrating visual and textual information for enhanced road condition analysis, setting new benchmarks for the field and paving the way for more effective infrastructure monitoring through multimodal learning.

主题：	计算工程、金融与科学 (cs.CE)
引用方式：	arXiv:2507.17353 [cs.CE]
	(或者 arXiv:2507.17353v1 [cs.CE] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.17353

提交历史

来自： Xi Xiao [查看电子邮件]
[v1] 星期三， 2025 年 7 月 23 日 09:34:35 UTC (1,611 KB)

计算机科学 > 计算工程、金融与科学

标题： RoadBench：一种用于道路损伤理解的视觉-语言基础模型和基准

标题： RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算工程、金融与科学

标题： RoadBench：一种用于道路损伤理解的视觉-语言基础模型和基准 显示英文标题

标题： RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： RoadBench：一种用于道路损伤理解的视觉-语言基础模型和基准