Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving

Agullo, Ferran; Oliveras, Joan; Wang, Chen; Gutierrez-Torre, Alberto; Tardieu, Olivier; Youssef, Alaa; Torres, Jordi; Berral, Josep Ll.

计算机科学 > 性能

arXiv:2508.08343v1 (cs)

[提交于 2025年8月11日 ]

标题：通过最优适配器缓存提高GPU效率：多租户LLM服务的分析方法

标题： Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving

Authors:Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral

摘要：将LLM适配器用于服务已引起广泛关注，作为一种有效的方法，将通用语言模型适应到各种特定任务的使用场景中。然而，服务大量适配器会引入多个且显著的开销，导致性能下降和最佳放置的挑战。为解决这些挑战，我们提出了一种分析性、由AI驱动的流程，能够准确确定单节点设置中适配器的最佳分配。这种分配最大化性能，有效利用GPU资源，同时防止请求饥饿。至关重要的是，所提出的分配是基于当前工作负载模式的。这些在单节点设置中的见解可以用于多副本部署中的整体放置、负载平衡和服务器配置，最终提高整体性能并改善资源效率。我们的方法建立在对LLM适配器服务的深入分析基础上，考虑了开销和性能变化，并包括开发第一个能够以匹配的关键性能指标复制在线LLM-适配器服务系统的数字孪生体。实验结果表明，与真实结果相比，数字孪生体在吞吐量方面的SMAPE差异不超过5.5%，并且所提出的流程能够以最小的延迟准确预测最佳放置。

摘要： Serving LLM adapters has gained significant attention as an effective approach to adapt general-purpose language models to diverse, task-specific use cases. However, serving a wide range of adapters introduces several and substantial overheads, leading to performance degradation and challenges in optimal placement. To address these challenges, we present an analytical, AI-driven pipeline that accurately determines the optimal allocation of adapters in single-node setups. This allocation maximizes performance, effectively using GPU resources, while preventing request starvation. Crucially, the proposed allocation is given based on current workload patterns. These insights in single-node setups can be leveraged in multi-replica deployments for overall placement, load balancing and server configuration, ultimately enhancing overall performance and improving resource efficiency. Our approach builds on an in-depth analysis of LLM adapter serving, accounting for overheads and performance variability, and includes the development of the first Digital Twin capable of replicating online LLM-adapter serving systems with matching key performance metrics. The experimental results demonstrate that the Digital Twin achieves a SMAPE difference of no more than 5.5% in throughput compared to real results, and the proposed pipeline accurately predicts the optimal placement with minimal latency.

评论：	正在审稿中的计算机科学会议
主题：	性能 (cs.PF) ; 人工智能 (cs.AI); 计算与语言 (cs.CL)
引用方式：	arXiv:2508.08343 [cs.PF]
	(或者 arXiv:2508.08343v1 [cs.PF] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.08343

提交历史

来自： Ferran Agullo [查看电子邮件]
[v1] 星期一， 2025 年 8 月 11 日 10:47:35 UTC (251 KB)

计算机科学 > 性能

标题：通过最优适配器缓存提高GPU效率：多租户LLM服务的分析方法

标题： Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 性能

标题： 通过最优适配器缓存提高GPU效率：多租户LLM服务的分析方法 显示英文标题

标题： Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：通过最优适配器缓存提高GPU效率：多租户LLM服务的分析方法