Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > cs > arXiv:2506.21901

Help | Advanced Search

Computer Science > Databases

arXiv:2506.21901 (cs)
[Submitted on 27 Jun 2025 ]

Title: A Survey of LLM Inference Systems

Title: 大语言模型推理系统的综述

Authors:James Pan, Guoliang Li
Abstract: The past few years has witnessed specialized large language model (LLM) inference systems, such as vLLM, SGLang, Mooncake, and DeepFlow, alongside rapid LLM adoption via services like ChatGPT. Driving these system design efforts is the unique autoregressive nature of LLM request processing, motivating new techniques for achieving high performance while preserving high inference quality over high-volume and high-velocity workloads. While many of these techniques are discussed across the literature, they have not been analyzed under the framework of a complete inference system, nor have the systems themselves been analyzed and compared. In this survey, we review these techniques, starting from operators and algorithms for request processing, then moving on to techniques for model optimization and execution, including kernel design, batching, and scheduling, before ending with techniques for memory management, including paged memory, eviction and offloading techniques, quantization, and cache persistence. Through these discussions, we show that these techniques fundamentally rely on load prediction, adaptive mechanisms, and cost reduction in order to overcome the challenges introduced by autoregressive generation and achieve the goals of the system. We then discuss how these techniques can be combined to form single-replica and multi-replica inference systems, including disaggregated inference systems that offer more control over resource allocation and serverless systems that can be deployed over shared hardware infrastructure. We end with a discussion of remaining challenges.
Abstract: 过去几年见证了专门的大语言模型(LLM)推理系统,如vLLM、SGLang、Mooncake和DeepFlow,同时通过ChatGPT等服务,LLM的采用也迅速增长。 推动这些系统设计努力的是LLM请求处理的独特自回归特性,这促使了在高容量和高速度工作负载下实现高性能同时保持高推理质量的新技术。 虽然文献中讨论了许多这些技术,但它们尚未在完整的推理系统框架下进行分析,也没有对系统本身进行分析和比较。 在本综述中,我们从请求处理的操作符和算法开始,回顾这些技术,然后转向模型优化和执行技术,包括内核设计、批处理和调度,最后是内存管理技术,包括分页内存、逐出和卸载技术、量化和缓存持久性。 通过这些讨论,我们表明这些技术从根本上依赖于负载预测、自适应机制和成本降低,以克服自回归生成带来的挑战,并实现系统的目标。 然后我们讨论这些技术如何结合形成单副本和多副本推理系统,包括提供更精细资源分配控制的解耦推理系统以及可以在共享硬件基础设施上部署的无服务器系统。 最后,我们讨论了剩余的挑战。
Comments: 25 pages
Subjects: Databases (cs.DB)
Cite as: arXiv:2506.21901 [cs.DB]
  (or arXiv:2506.21901v1 [cs.DB] for this version)
  https://doi.org/10.48550/arXiv.2506.21901
arXiv-issued DOI via DataCite

Submission history

From: James Pan [view email]
[v1] Fri, 27 Jun 2025 04:38:20 UTC (2,622 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
  • Other Formats
view license
Current browse context:
cs.DB
< prev   |   next >
new | recent | 2025-06
Change to browse by:
cs

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号