EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

Shen, Zheyu; He, Yexiao; Wang, Ziyao; Zhang, Yuning; Sun, Guoheng; Ye, Wanghao; Li, Ang

doi:10.1145/3711875.3729141

计算机科学 > 分布式、并行与集群计算

arXiv:2507.01438 (cs)

[提交于 2025年7月2日 ]

标题： EdgeLoRA：一种在边缘设备上的高效多租户大语言模型服务系统

标题： EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

Authors:Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, Ang Li

摘要：大型语言模型（LLMs）由于在各种应用中的多功能性而受到广泛关注。使用参数高效的适配器对LLMs进行微调，例如低秩适配（LoRA），使这些模型能够在不进行大量重新训练的情况下高效适应下游任务。在多租户边缘设备上部署微调后的LLMs可以带来显著的优势，例如降低延迟、增强隐私性和个性化响应。然而，在资源受限的边缘设备上高效地提供LLMs服务面临关键挑战，包括不同任务的适配器选择复杂性和频繁适配器切换带来的内存开销。此外，鉴于多租户环境中的多个请求，按顺序处理请求会导致计算资源利用率低下和延迟增加。本文介绍了EdgeLoRA，这是一种在多租户环境中在边缘设备上提供LLMs的高效系统。EdgeLoRA包含三个关键创新：（1）自适应适配器选择机制，以简化适配器配置过程；（2）异构内存管理，利用智能适配器缓存和池化来减轻内存操作开销；（3）批量LoRA推理，实现高效的批量处理，显著减少计算延迟。使用Llama3.1-8B模型进行全面评估表明，EdgeLoRA在延迟和吞吐量方面均显著优于现状（即llama.cpp）。结果表明，EdgeLoRA可以将吞吐量提高多达4倍。更令人印象深刻的是，它能够同时服务数量级更多的适配器。这些结果突显了EdgeLoRA在多租户场景中转变LLMs边缘部署的潜力，为资源受限环境提供了可扩展且高效的解决方案。

摘要： Large Language Models (LLMs) have gained significant attention due to their versatility across a wide array of applications. Fine-tuning LLMs with parameter-efficient adapters, such as Low-Rank Adaptation (LoRA), enables these models to efficiently adapt to downstream tasks without extensive retraining. Deploying fine-tuned LLMs on multi-tenant edge devices offers substantial benefits, such as reduced latency, enhanced privacy, and personalized responses. However, serving LLMs efficiently on resource-constrained edge devices presents critical challenges, including the complexity of adapter selection for different tasks and memory overhead from frequent adapter swapping. Moreover, given the multiple requests in multi-tenant settings, processing requests sequentially results in underutilization of computational resources and increased latency. This paper introduces EdgeLoRA, an efficient system for serving LLMs on edge devices in multi-tenant environments. EdgeLoRA incorporates three key innovations: (1) an adaptive adapter selection mechanism to streamline the adapter configuration process; (2) heterogeneous memory management, leveraging intelligent adapter caching and pooling to mitigate memory operation overhead; and (3) batch LoRA inference, enabling efficient batch processing to significantly reduce computational latency. Comprehensive evaluations using the Llama3.1-8B model demonstrate that EdgeLoRA significantly outperforms the status quo (i.e., llama.cpp) in terms of both latency and throughput. The results demonstrate that EdgeLoRA can achieve up to a 4 times boost in throughput. Even more impressively, it can serve several orders of magnitude more adapters simultaneously. These results highlight EdgeLoRA's potential to transform edge deployment of LLMs in multi-tenant scenarios, offering a scalable and efficient solution for resource-constrained environments.

主题：	分布式、并行与集群计算 (cs.DC) ; 人工智能 (cs.AI); 机器学习 (cs.LG)
引用方式：	arXiv:2507.01438 [cs.DC]
	(或者 arXiv:2507.01438v1 [cs.DC] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.01438
相关 DOI:	https://doi.org/10.1145/3711875.3729141

提交历史

来自： Zheyu Shen [查看电子邮件]
[v1] 星期三， 2025 年 7 月 2 日 07:47:28 UTC (1,010 KB)

计算机科学 > 分布式、并行与集群计算

标题： EdgeLoRA：一种在边缘设备上的高效多租户大语言模型服务系统

标题： EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 分布式、并行与集群计算

标题： EdgeLoRA：一种在边缘设备上的高效多租户大语言模型服务系统 显示英文标题

标题： EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： EdgeLoRA：一种在边缘设备上的高效多租户大语言模型服务系统