World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

Zheng, Yupeng; Yang, Pengxuan; Xing, Zebin; Zhang, Qichao; Zheng, Yuhang; Gao, Yinfeng; Li, Pengfei; Zhang, Teng; Xia, Zhongpu; Jia, Peng; Zhao, Dongbin

计算机科学 > 计算机视觉与模式识别

arXiv:2507.00603 (cs)

[提交于 2025年7月1日 ]

标题： World4Drive：通过意图感知的物理潜在世界模型实现端到端自动驾驶

标题： World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

Authors:Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, Dongbin Zhao

摘要：端到端自动驾驶直接从原始传感器数据生成规划轨迹，但通常依赖于昂贵的感知监督来提取场景信息。一个关键的研究挑战出现：构建一个信息丰富的驾驶世界模型，以实现无需感知注释的端到端规划，通过自监督学习。在本文中，我们提出 World4Drive，一种端到端自动驾驶框架，该框架使用视觉基础模型构建潜在世界模型，用于生成和评估多模态规划轨迹。具体来说，World4Drive首先提取场景特征，包括驾驶意图和由视觉基础模型提供的富含空间语义先验的潜在世界表示。然后，它根据当前场景特征和驾驶意图生成多模态规划轨迹，并在潜在空间内预测多个由意图驱动的未来状态。最后，它引入一个世界模型选择模块来评估和选择最佳轨迹。我们通过实际未来观测与从潜在空间重建的预测观测之间的自监督对齐实现了无需感知注释的端到端规划。 World4Drive在开环的nuScenes和闭环的NavSim基准测试中均实现了最先进的性能，展示了L2误差降低了18.1%，碰撞率降低了46.7%，训练收敛速度提高了3.75倍。代码将在https://github.com/ucaszyp/World4Drive获取。

摘要： End-to-end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation-free, end-to-end planning via self-supervised learning. In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial-semantic priors provided by vision foundation models. It then generates multi-modal planning trajectories based on current scene features and driving intentions and predicts multiple intention-driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation-free, end-to-end planning through self-supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state-of-the-art performance without manual perception annotations on both the open-loop nuScenes and closed-loop NavSim benchmarks, demonstrating an 18.1\% relative reduction in L2 error, 46.7% lower collision rate, and 3.75 faster training convergence. Codes will be accessed at https://github.com/ucaszyp/World4Drive.

评论：	ICCV 2025，第一版
主题：	计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2507.00603 [cs.CV]
	(或者 arXiv:2507.00603v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.00603

提交历史

来自： Yupeng Zheng [查看电子邮件]
[v1] 星期二， 2025 年 7 月 1 日 09:36:38 UTC (1,486 KB)

计算机科学 > 计算机视觉与模式识别

标题： World4Drive：通过意图感知的物理潜在世界模型实现端到端自动驾驶

标题： World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： World4Drive：通过意图感知的物理潜在世界模型实现端到端自动驾驶 显示英文标题

标题： World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： World4Drive：通过意图感知的物理潜在世界模型实现端到端自动驾驶