StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

Wei, Meng; Wan, Chenyang; Yu, Xiqian; Wang, Tai; Yang, Yuqiang; Mao, Xiaohan; Zhu, Chenming; Cai, Wenzhe; Wang, Hanqing; Chen, Yilun; Liu, Xihui; Pang, Jiangmiao

计算机科学 > 机器人技术

arXiv:2507.05240 (cs)

[提交于 2025年7月7日 ]

标题： StreamVLN：通过SlowFast上下文建模的流式视觉-语言导航

标题： StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

Authors:Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang

摘要：视觉与语言导航（VLN）在现实场景中需要智能体处理连续的视觉流，并根据语言指令生成低延迟的动作。虽然基于视频的大规模语言模型（Video-LLMs）推动了近期的进步，但基于Video-LLM的当前VLN方法在细粒度视觉理解、长期上下文建模和计算效率之间常常面临权衡。我们引入了StreamVLN，这是一种流式VLN框架，采用混合的慢速-快速上下文建模策略，以支持交错的视觉、语言和动作输入的多模态推理。快速流式对话上下文通过活动对话的滑动窗口促进响应式动作生成，而慢速更新的记忆上下文则通过3D感知的标记剪枝策略压缩历史视觉状态。通过这种慢速-快速设计，StreamVLN通过高效的键值缓存重用实现了连贯的多轮对话，支持具有有限上下文大小和推理成本的长视频流。在VLN-CE基准上的实验表明，其性能处于最先进水平，且具有稳定的低延迟，确保了现实部署中的鲁棒性和效率。项目页面是： \href{https://streamvln.github.io/}{https://streamvln.github.io/}

摘要： Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: \href{https://streamvln.github.io/}{https://streamvln.github.io/}.

主题：	机器人技术 (cs.RO) ; 计算机视觉与模式识别 (cs.CV)
引用方式：	arXiv:2507.05240 [cs.RO]
	(或者 arXiv:2507.05240v1 [cs.RO] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.05240

提交历史

来自： Meng Wei [查看电子邮件]
[v1] 星期一， 2025 年 7 月 7 日 17:49:41 UTC (9,614 KB)

计算机科学 > 机器人技术

标题： StreamVLN：通过SlowFast上下文建模的流式视觉-语言导航

标题： StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器人技术

标题： StreamVLN：通过SlowFast上下文建模的流式视觉-语言导航 显示英文标题

标题： StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： StreamVLN：通过SlowFast上下文建模的流式视觉-语言导航