A Survey of LLM Inference Systems

Pan, James; Li, Guoliang

Computer Science > Databases

arXiv:2506.21901 (cs)

[Submitted on 27 Jun 2025 ]

Title: A Survey of LLM Inference Systems

Title: 大语言模型推理系统的综述

Authors:James Pan, Guoliang Li

Abstract: The past few years has witnessed specialized large language model (LLM) inference systems, such as vLLM, SGLang, Mooncake, and DeepFlow, alongside rapid LLM adoption via services like ChatGPT. Driving these system design efforts is the unique autoregressive nature of LLM request processing, motivating new techniques for achieving high performance while preserving high inference quality over high-volume and high-velocity workloads. While many of these techniques are discussed across the literature, they have not been analyzed under the framework of a complete inference system, nor have the systems themselves been analyzed and compared. In this survey, we review these techniques, starting from operators and algorithms for request processing, then moving on to techniques for model optimization and execution, including kernel design, batching, and scheduling, before ending with techniques for memory management, including paged memory, eviction and offloading techniques, quantization, and cache persistence. Through these discussions, we show that these techniques fundamentally rely on load prediction, adaptive mechanisms, and cost reduction in order to overcome the challenges introduced by autoregressive generation and achieve the goals of the system. We then discuss how these techniques can be combined to form single-replica and multi-replica inference systems, including disaggregated inference systems that offer more control over resource allocation and serverless systems that can be deployed over shared hardware infrastructure. We end with a discussion of remaining challenges.

Abstract: 过去几年见证了专门的大语言模型（LLM）推理系统，如vLLM、SGLang、Mooncake和DeepFlow，同时通过ChatGPT等服务，LLM的采用也迅速增长。推动这些系统设计努力的是LLM请求处理的独特自回归特性，这促使了在高容量和高速度工作负载下实现高性能同时保持高推理质量的新技术。虽然文献中讨论了许多这些技术，但它们尚未在完整的推理系统框架下进行分析，也没有对系统本身进行分析和比较。在本综述中，我们从请求处理的操作符和算法开始，回顾这些技术，然后转向模型优化和执行技术，包括内核设计、批处理和调度，最后是内存管理技术，包括分页内存、逐出和卸载技术、量化和缓存持久性。通过这些讨论，我们表明这些技术从根本上依赖于负载预测、自适应机制和成本降低，以克服自回归生成带来的挑战，并实现系统的目标。然后我们讨论这些技术如何结合形成单副本和多副本推理系统，包括提供更精细资源分配控制的解耦推理系统以及可以在共享硬件基础设施上部署的无服务器系统。最后，我们讨论了剩余的挑战。

Comments:	25 pages
Subjects:	Databases (cs.DB)
Cite as:	arXiv:2506.21901 [cs.DB]
	(or arXiv:2506.21901v1 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2506.21901

Submission history

From: James Pan [view email]
[v1] Fri, 27 Jun 2025 04:38:20 UTC (2,622 KB)

Computer Science > Databases

Title: A Survey of LLM Inference Systems

Title: 大语言模型推理系统的综述

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title: A Survey of LLM Inference Systems Show Chinese title

Title: 大语言模型推理系统的综述

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: A Survey of LLM Inference Systems