SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference

He, Yongchao; Zhao, Bohan; Cao, Zheng

计算机科学 > 分布式、并行与集群计算

arXiv:2506.22033v1 (cs)

[提交于 2025年6月27日 ]

标题： SiPipe：填补CPU-GPU利用率差距以实现高效流水线并行大语言模型推理

标题： SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference

Authors:Yongchao He, Bohan Zhao, Zheng Cao

摘要：随着大型语言模型（LLMs）的推理工作负载规模扩大以满足不断增长的用户需求，管道并行性（PP）已成为多GPU部署中广泛采用的策略，特别是在跨节点设置中，以提高键值（KV）缓存容量和推理吞吐量。然而，PP由于三种类型的执行气泡——负载不平衡、阶段内和阶段间——导致固有的低效性，限制了管道的饱和度。我们提出了SiPipe，一种异构管道设计，通过利用未充分利用的CPU资源来卸载辅助计算和通信，从而提高吞吐量。 SiPipe结合了三种关键技术——CPU采样、令牌安全执行模型和结构感知传输——以减轻管道气泡并提高执行效率。在多种LLMs上，与相同PP配置下的最先进vLLM相比，SiPipe实现了高达2.1倍的更高吞吐量，每令牌延迟降低了43%，以及高达23%的平均GPU利用率，证明了其在LLMs和部署场景中的通用性。

摘要： As inference workloads for large language models (LLMs) scale to meet growing user demand, pipeline parallelism (PP) has become a widely adopted strategy for multi-GPU deployment, particularly in cross-node setups, to improve key-value (KV) cache capacity and inference throughput. However, PP suffers from inherent inefficiencies caused by three types of execution bubbles-load-imbalance, intra-stage, and inter-stage-which limit pipeline saturation. We present SiPipe, a heterogeneous pipeline design that improves throughput by leveraging underutilized CPU resources to offload auxiliary computation and communication. SiPipe incorporates three key techniques-CPU sampling, a token-safe execution model, and structure-aware transmission-to mitigate pipeline bubbles and improve execution efficiency. Across diverse LLMs, SiPipe achieves up to 2.1 times higher throughput, 43% lower per-token latency, and up to 23% higher average GPU utilization compared to the state-of-the-art vLLM under the same PP configuration, demonstrating its generality across LLMs and deployment scenarios.

主题：	分布式、并行与集群计算 (cs.DC)
引用方式：	arXiv:2506.22033 [cs.DC]
	(或者 arXiv:2506.22033v1 [cs.DC] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.22033

提交历史

来自： Bohan Zhao [查看电子邮件]
[v1] 星期五， 2025 年 6 月 27 日 09:27:04 UTC (1,683 KB)

计算机科学 > 分布式、并行与集群计算

标题： SiPipe：填补CPU-GPU利用率差距以实现高效流水线并行大语言模型推理

标题： SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 分布式、并行与集群计算

标题： SiPipe：填补CPU-GPU利用率差距以实现高效流水线并行大语言模型推理 显示英文标题

标题： SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： SiPipe：填补CPU-GPU利用率差距以实现高效流水线并行大语言模型推理