Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Qin, Ruoyu; Li, Zheming; He, Weiran; Zhang, Mingxing; Wu, Yongwei; Zheng, Weimin; Xu, Xinran

计算机科学 > 分布式、并行与集群计算

arXiv:2407.00079v3 (cs)

[提交于 2024年6月24日 (v1) ，修订后的 2024年7月9日 (此版本， v3) ， 最新版本 2025年9月3日 (v4) ]

标题：月饼：面向大语言模型服务的以KV缓存为中心的解耦架构

标题： Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Authors:Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu

摘要：月饼是Kimi的部署平台，Kimi是由Moonshot AI提供的领先大语言模型服务。它采用以KV缓存为中心的解耦架构，将预填充和解码集群分离。它还利用GPU集群中未充分利用的CPU、DRAM和SSD资源，实现KV缓存的解耦缓存。月饼的核心是其以KV缓存为中心的调度器，在最大化整体有效吞吐量的同时满足与延迟相关的服务等级目标（SLO）。与传统研究假设所有请求都将被处理不同，月饼由于高度过载的场景而面临挑战。为缓解这些问题，我们开发了一种基于预测的早期拒绝策略。实验表明，月饼在长上下文场景中表现出色。与基线方法相比，月饼在某些模拟场景中可以在遵守SLO的前提下实现吞吐量最高提升525%。在实际工作负载下，月饼的创新架构使Kimi能够处理75%更多的请求。

摘要： Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake's innovative architecture enables Kimi to handle 75% more requests.

评论：	23页，13图
主题：	分布式、并行与集群计算 (cs.DC) ; 人工智能 (cs.AI); 硬件架构 (cs.AR)
引用方式：	arXiv:2407.00079 [cs.DC]
	(或者 arXiv:2407.00079v3 [cs.DC] 对于此版本)
	https://doi.org/10.48550/arXiv.2407.00079

提交历史

来自： Ruoyu Qin [查看电子邮件]
[v1] 星期一， 2024 年 6 月 24 日 02:05:32 UTC (264 KB)
[v2] 星期二， 2024 年 7 月 2 日 02:49:35 UTC (264 KB)
[v3] 星期二， 2024 年 7 月 9 日 04:03:10 UTC (280 KB)
[v4] 星期三， 2025 年 9 月 3 日 14:56:29 UTC (245 KB)

计算机科学 > 分布式、并行与集群计算

标题：月饼：面向大语言模型服务的以KV缓存为中心的解耦架构

标题： Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 分布式、并行与集群计算

标题： 月饼：面向大语言模型服务的以KV缓存为中心的解耦架构 显示英文标题

标题： Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：月饼：面向大语言模型服务的以KV缓存为中心的解耦架构