FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

Liu, Xing; Luo, Lizhuo; Tang, Ming; Huang, Chao

计算机科学 > 分布式、并行与集群计算

arXiv:2507.02620 (cs)

[提交于 2025年7月3日 ]

标题： FlowSpec：用于高效分布式大语言模型推理的连续流水线推测解码

标题： FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

Authors:Xing Liu, Lizhuo Luo, Ming Tang, Chao Huang

摘要：分布式推理是一种有前景的方法，用于在网络边缘实现大型语言模型（LLMs）的推理。它将推理过程分配到多个设备上，以确保LLMs能够适应设备内存。最近的流水线方法有望并行化通信和计算，这有助于减少推理延迟。然而，当网络边缘的推理请求稀疏时，这种优势会减弱，此时流水线通常利用率较低。为了在边缘实现高效的分布式LLM推理，我们提出了 \textbf{流规范}，一种基于流水线并行的树状推测解码框架。 FlowSpec结合了三种关键机制来提高解码效率：1）基于分数的逐步验证优先考虑更重要的草稿标记，以带来更早接受的标记；2）高效的草稿管理，在验证过程中修剪无效标记同时保持正确的因果关系；3）动态草稿扩展策略，以提供高质量的推测输入。这些技术协同工作，以提高流水线利用率和推测效率。我们在一个真实世界的测试平台与其他基线进行了FlowSpec的评估。实验结果表明，我们提出的框架在各种模型和配置中显著提高了推理速度，与基线相比，加速比为1.36$\times$-1.77$\times$。我们的代码可在 \href{https://github.com/Leosang-lx/FlowSpec#}{https://github.com/Leosang-lx/FlowSpec#}公开获得。

摘要： Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device memory. Recent pipeline-based approaches have the potential to parallelize communication and computation, which helps reduce inference latency. However, the benefit diminishes when the inference request at the network edge is sparse, where pipeline is typically at low utilization. To enable efficient distributed LLM inference at the edge, we propose \textbf{FlowSpec}, a pipeline-parallel tree-based speculative decoding framework. FlowSpec incorporates three key mechanisms to improve decoding efficiency: 1) score-based step-wise verification prioritizes more important draft tokens to bring earlier accpeted tokens; 2) efficient draft management to prune invalid tokens while maintaining correct causal relationship during verification; 3) dynamic draft expansion strategies to supply high-quality speculative inputs. These techniques work in concert to enhance both pipeline utilization and speculative efficiency. We evaluate FlowSpec on a real-world testbed with other baselines. Experimental results demonstrate that our proposed framework significantly improves inference speed across diverse models and configurations, achieving speedup ratios 1.36$\times$-1.77$\times$ compared to baselines. Our code is publicly available at \href{https://github.com/Leosang-lx/FlowSpec#}{https://github.com/Leosang-lx/FlowSpec\#}

评论：	16页，最后3页是附录
主题：	分布式、并行与集群计算 (cs.DC) ; 人工智能 (cs.AI)
引用方式：	arXiv:2507.02620 [cs.DC]
	(或者 arXiv:2507.02620v1 [cs.DC] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.02620

提交历史

来自： Xing Liu [查看电子邮件]
[v1] 星期四， 2025 年 7 月 3 日 13:47:42 UTC (1,106 KB)

计算机科学 > 分布式、并行与集群计算

标题： FlowSpec：用于高效分布式大语言模型推理的连续流水线推测解码

标题： FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 分布式、并行与集群计算

标题： FlowSpec：用于高效分布式大语言模型推理的连续流水线推测解码 显示英文标题

标题： FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： FlowSpec：用于高效分布式大语言模型推理的连续流水线推测解码