Research on Driving Scenario Technology Based on Multimodal Large Lauguage Model Optimization

Mengjie, Wang; Huiping, Zhu; Jian, Li; Wenxiu, Shi; Song, Zhang

计算机科学 > 计算机视觉与模式识别

arXiv:2506.02014 (cs)

[提交于 2025年5月28日 ]

标题：基于多模态大语言模型优化的驾驶场景技术研究

标题： Research on Driving Scenario Technology Based on Multimodal Large Lauguage Model Optimization

Authors:Wang Mengjie, Zhu Huiping, Li Jian, Shi Wenxiu, Zhang Song

摘要：随着自动驾驶和辅助驾驶技术的发展，对理解复杂驾驶场景的能力提出了更高的要求。多模态通用大模型应运而生，成为解决这一挑战的方案。然而，在垂直领域应用这些模型时，存在数据收集、模型训练以及部署优化等方面的困难。本文提出了一种针对驾驶场景中多模态模型优化的综合方法，包括锥桶检测、交通灯识别、限速推荐和路口警报等功能。该方法涵盖了动态提示优化、数据集构建、模型训练和部署等关键方面。具体而言，动态提示优化根据输入图像内容调整提示信息，聚焦影响本车的物体，增强模型的任务特定聚焦和判断能力。数据集通过结合真实数据和合成数据构建，形成高质量且多样化的多模态训练数据集，提高模型在复杂驾驶环境中的泛化能力。在模型训练中，采用知识蒸馏、动态微调和量化等先进技术，降低存储和计算成本的同时提升性能。实验结果表明，这种系统性优化方法不仅显著提升了模型在关键任务上的准确性，还实现了高效的资源利用，为驾驶场景感知技术的实际应用提供了有力支持。

摘要： With the advancement of autonomous and assisted driving technologies, higher demands are placed on the ability to understand complex driving scenarios. Multimodal general large models have emerged as a solution for this challenge. However, applying these models in vertical domains involves difficulties such as data collection, model training, and deployment optimization. This paper proposes a comprehensive method for optimizing multimodal models in driving scenarios, including cone detection, traffic light recognition, speed limit recommendation, and intersection alerts. The method covers key aspects such as dynamic prompt optimization, dataset construction, model training, and deployment. Specifically, the dynamic prompt optimization adjusts the prompts based on the input image content to focus on objects affecting the ego vehicle, enhancing the model's task-specific focus and judgment capabilities. The dataset is constructed by combining real and synthetic data to create a high-quality and diverse multimodal training dataset, improving the model's generalization in complex driving environments. In model training, advanced techniques like knowledge distillation, dynamic fine-tuning, and quantization are integrated to reduce storage and computational costs while boosting performance. Experimental results show that this systematic optimization method not only significantly improves the model's accuracy in key tasks but also achieves efficient resource utilization, providing strong support for the practical application of driving scenario perception technologies.

主题：	计算机视觉与模式识别 (cs.CV) ; 人工智能 (cs.AI)
引用方式：	arXiv:2506.02014 [cs.CV]
	(或者 arXiv:2506.02014v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.02014

提交历史

来自： Mengjie Wang [查看电子邮件]
[v1] 星期三， 2025 年 5 月 28 日 02:22:11 UTC (2,582 KB)

计算机科学 > 计算机视觉与模式识别

标题：基于多模态大语言模型优化的驾驶场景技术研究

标题： Research on Driving Scenario Technology Based on Multimodal Large Lauguage Model Optimization

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： 基于多模态大语言模型优化的驾驶场景技术研究 显示英文标题

标题： Research on Driving Scenario Technology Based on Multimodal Large Lauguage Model Optimization

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基于多模态大语言模型优化的驾驶场景技术研究