Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving

Kim, Wonung; Lee, Yubin; Kim, Yoonsung; Hwang, Jinwoo; Oh, Seongryong; Jung, Jiyong; Huseynov, Aziz; Park, Woong Gyu; Park, Chang Hyun; Mahajan, Divya; Park, Jongse

计算机科学 > 硬件架构

arXiv:2507.10178 (cs)

[提交于 2025年7月14日 ]

标题： Pimba：面向后变压器大型语言模型服务的存储内处理加速

标题： Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving

Authors:Wonung Kim, Yubin Lee, Yoonsung Kim, Jinwoo Hwang, Seongryong Oh, Jiyong Jung, Aziz Huseynov, Woong Gyu Park, Chang Hyun Park, Divya Mahajan, Jongse Park

摘要： Transformer是当今大型语言模型(LLMs)的驱动力，它们是其性能和多功能性的基础。然而，它们的计算和内存成本随着序列长度的增长而增加，这对长上下文推理提出了可扩展性挑战。作为回应，算法界正在探索替代架构，如状态空间模型(SSMs)、线性注意力和循环神经网络(RNNs)，我们将其称为后Transformer架构。这种转变带来了一个关键挑战：在统一框架内构建一个能够高效支持Transformer和后Transformer LLMs的服务系统。为了解决这个挑战，我们分析了Transformer和后Transformer LLMs的性能特征。尽管它们在算法上有差异，但由于Transformer中的注意力和后Transformer中的状态更新，在批量推理下，它们都受到内存带宽的根本限制。进一步的分析表明了两个额外的见解：(1) 状态更新操作与注意力不同，会产生较高的硬件成本，使得每bank的PIM加速效率低下，(2) 不同的低精度算术方法提供了不同的准确率-面积权衡，而我们确定微软的MX是帕累托最优选择。基于这些见解，我们设计了Pimba，它是一组状态更新处理单元(SPU)，每个SPU在两个bank之间共享，以实现对PIM的交错访问。每个SPU包括一个状态更新处理引擎(SPE)，该引擎使用基于MX的量化算术，包含逐元素乘法器和加法器，从而能够高效执行状态更新和注意力操作。我们的评估显示，与LLM优化的GPU和GPU+PIM系统相比，Pimba分别实现了最高3.2倍和2.1倍的标记生成吞吐量。

摘要： Transformers are the driving force behind today's Large Language Models (LLMs), serving as the foundation for their performance and versatility. Yet, their compute and memory costs grow with sequence length, posing scalability challenges for long-context inferencing. In response, the algorithm community is exploring alternative architectures, such as state space models (SSMs), linear attention, and recurrent neural networks (RNNs), which we refer to as post-transformers. This shift presents a key challenge: building a serving system that efficiently supports both transformer and post-transformer LLMs within a unified framework. To address this challenge, we analyze the performance characteristics of transformer and post-transformer LLMs. Despite their algorithmic differences, both are fundamentally limited by memory bandwidth under batched inference due to attention in transformers and state updates in post-transformers. Further analyses suggest two additional insights: (1) state update operations, unlike attention, incur high hardware cost, making per-bank PIM acceleration inefficient, and (2) different low-precision arithmetic methods offer varying accuracy-area tradeoffs, while we identify Microsoft's MX as the Pareto-optimal choice. Building on these insights, we design Pimba as an array of State-update Processing Units (SPUs), each shared between two banks to enable interleaved access to PIM. Each SPU includes a State-update Processing Engine (SPE) that comprises element-wise multipliers and adders using MX-based quantized arithmetic, enabling efficient execution of state update and attention operations. Our evaluation shows that, compared to LLM-optimized GPU and GPU+PIM systems, Pimba achieves up to 3.2x and 2.1x higher token generation throughput, respectively.

主题：	硬件架构 (cs.AR) ; 机器学习 (cs.LG)
引用方式：	arXiv:2507.10178 [cs.AR]
	(或者 arXiv:2507.10178v1 [cs.AR] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.10178

提交历史

来自： Wonung Kim [查看电子邮件]
[v1] 星期一， 2025 年 7 月 14 日 11:40:17 UTC (521 KB)

计算机科学 > 硬件架构

标题： Pimba：面向后变压器大型语言模型服务的存储内处理加速

标题： Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 硬件架构

标题： Pimba：面向后变压器大型语言模型服务的存储内处理加速 显示英文标题

标题： Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： Pimba：面向后变压器大型语言模型服务的存储内处理加速