SLIM: A Heterogeneous Accelerator for Edge Inference of Sparse Large Language Model via Adaptive Thresholding

Xu, Weihong; Choi, Haein; Hsu, Po-kai; Yu, Shimeng; Rosing, Tajana

计算机科学 > 硬件架构

arXiv:2507.09201 (cs)

[提交于 2025年7月12日 ]

标题： SLIM：通过自适应阈值实现稀疏大型语言模型边缘推理的异构加速器

标题： SLIM: A Heterogeneous Accelerator for Edge Inference of Sparse Large Language Model via Adaptive Thresholding

Authors:Weihong Xu, Haein Choi, Po-kai Hsu, Shimeng Yu, Tajana Rosing

摘要：大型语言模型（LLMs）在理解和生成人类语言方面表现出色，但由于前馈网络（FFN）和多头注意力（MHA）层中的大模型尺寸和内存密集型操作，资源受限的嵌入式设备上的高效推理仍然具有挑战性。尽管现有的加速器将LLM推理卸载到昂贵的异构计算系统，但它们未能利用LLM操作中固有的显著稀疏性，导致硬件资源未被充分利用。我们提出了SLIM，这是一种针对边缘设备上稀疏LLM服务的算法-硬件协同设计。 SLIM通过自适应阈值算法利用LLM的稀疏性，实现实时可配置的稀疏性，且精度损失可以忽略不计，仅获取激活的神经元，从而大幅减少数据移动。我们的异构硬件架构战略性地结合了存储近端处理（NSP）和内存内处理（PIM）：FFN权重存储在高密度3D NAND中，并使用NSP单元进行计算，而内存密集型的MHA操作则在PIM模块中处理。这种设计显著减少了内存占用、数据移动和能耗。我们的全面评估证明了SLIM的有效性，在保持低延迟的同时，相比SSD-GPU系统实现了13-18倍的吞吐量提升，相比DRAM-GPU系统实现了9-10倍更好的能效，使边缘计算环境中的低成本LLM部署成为可能。

摘要： Large language models (LLMs) have demonstrated exceptional proficiency in understanding and generating human language, but efficient inference on resource-constrained embedded devices remains challenging due to large model sizes and memory-intensive operations in feedforward network (FFN) and multi-head attention (MHA) layers. While existing accelerators offload LLM inference to expensive heterogeneous computing systems, they fail to exploit the significant sparsity inherent in LLM operations, leaving hardware resources underutilized. We propose SLIM, an algorithm-hardware co-design optimized for sparse LLM serving on edge devices. SLIM exploits LLM sparsity through an adaptive thresholding algorithm that enables runtime-configurable sparsity with negligible accuracy loss, fetching only activated neurons to dramatically reduce data movement. Our heterogeneous hardware architecture strategically combines near-storage processing (NSP) and processing-in-memory (PIM): FFN weights are stored in high-density 3D NAND and computed using NSP units, while memory-intensive MHA operations are processed in PIM modules. This design significantly reduces memory footprint, data movement, and energy consumption. Our comprehensive evaluation demonstrates SLIM's effectiveness, achieving 13-18x throughput improvements over SSD-GPU systems and 9-10x better energy efficiency over DRAM-GPU systems while maintaining low latency, making cost-effective LLM deployment viable for edge computing environments.

主题：	硬件架构 (cs.AR) ; 分布式、并行与集群计算 (cs.DC)
引用方式：	arXiv:2507.09201 [cs.AR]
	(或者 arXiv:2507.09201v1 [cs.AR] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.09201

提交历史

来自： Weihong Xu [查看电子邮件]
[v1] 星期六， 2025 年 7 月 12 日 08:44:38 UTC (703 KB)

计算机科学 > 硬件架构

标题： SLIM：通过自适应阈值实现稀疏大型语言模型边缘推理的异构加速器

标题： SLIM: A Heterogeneous Accelerator for Edge Inference of Sparse Large Language Model via Adaptive Thresholding

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 硬件架构

标题： SLIM：通过自适应阈值实现稀疏大型语言模型边缘推理的异构加速器 显示英文标题

标题： SLIM: A Heterogeneous Accelerator for Edge Inference of Sparse Large Language Model via Adaptive Thresholding

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： SLIM：通过自适应阈值实现稀疏大型语言模型边缘推理的异构加速器