Efficient Deployment of Large Language Models on Resource-constrained Devices

Yao, Zhiwei; Xu, Yang; Xu, Hongli; Liao, Yunming; Xie, Zuan

计算机科学 > 机器学习

arXiv:2501.02438 (cs)

[提交于 2025年1月5日 ]

标题：大型语言模型在资源受限设备上的高效部署

标题： Efficient Deployment of Large Language Models on Resource-constrained Devices

Authors:Zhiwei Yao, Yang Xu, Hongli Xu, Yunming Liao, Zuan Xie

摘要：在资源受限（或弱）设备上部署大型语言模型（LLMs）由于资源有限和数据分布异构性而面临重大挑战。为解决数据问题，有必要使用设备上的私有数据对LLMs进行微调，以适应各种下游任务。虽然联邦学习（FL）提供了一种有前景的隐私保护解决方案，但现有的微调方法保留了原始LLM的大小，导致高推理延迟和过高的内存需求问题仍未解决。因此，我们设计了FedSpine，这是一种将参数高效微调（PEFT）与结构化剪枝相结合的FL框架，以实现LLMs在资源受限设备上的高效部署。具体而言，FedSpine引入了一个迭代过程来剪枝和调整LLMs的参数。为了减轻设备异构性的影响，采用了一种在线多臂老虎机（MAB）算法，无需任何关于设备计算和通信能力的先验知识，自适应地确定不同的剪枝比例和LoRA秩。结果表明，FedSpine在保持更高推理准确性的同时提高了微调效率。在包含80个设备的物理平台上进行的实验结果表明，与其它基线相比，FedSpine在相同稀疏度水平下可以将微调速度提高1.4$\times$-6.9$\times$，并将最终准确率提高0.4%-4.5%。

摘要： Deploying Large Language Models (LLMs) on resource-constrained (or weak) devices presents significant challenges due to limited resources and heterogeneous data distribution. To address the data concern, it is necessary to fine-tune LLMs using on-device private data for various downstream tasks. While Federated Learning (FL) offers a promising privacy-preserving solution, existing fine-tuning methods retain the original LLM size, leaving issues of high inference latency and excessive memory demands unresolved. Hence, we design FedSpine, an FL framework that combines Parameter- Efficient Fine-Tuning (PEFT) with structured pruning for efficient deployment of LLMs on resource-constrained devices. Specifically, FedSpine introduces an iterative process to prune and tune the parameters of LLMs. To mitigate the impact of device heterogeneity, an online Multi-Armed Bandit (MAB) algorithm is employed to adaptively determine different pruning ratios and LoRA ranks for heterogeneous devices without any prior knowledge of their computing and communication capabilities. As a result, FedSpine maintains higher inference accuracy while improving fine-tuning efficiency. Experimental results conducted on a physical platform with 80 devices demonstrate that FedSpine can speed up fine-tuning by 1.4$\times$-6.9$\times$ and improve final accuracy by 0.4%-4.5% under the same sparsity level compared to other baselines.

主题：	机器学习 (cs.LG) ; 人工智能 (cs.AI); 计算与语言 (cs.CL); 分布式、并行与集群计算 (cs.DC)
引用方式：	arXiv:2501.02438 [cs.LG]
	(或者 arXiv:2501.02438v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.02438

提交历史

来自： Zhiwei Yao [查看电子邮件]
[v1] 星期日， 2025 年 1 月 5 日 04:38:11 UTC (2,840 KB)

计算机科学 > 机器学习

标题：大型语言模型在资源受限设备上的高效部署

标题： Efficient Deployment of Large Language Models on Resource-constrained Devices

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 大型语言模型在资源受限设备上的高效部署 显示英文标题

标题： Efficient Deployment of Large Language Models on Resource-constrained Devices

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：大型语言模型在资源受限设备上的高效部署