The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts

Yun, Sungmin; Park, Seonyong; Nam, Hwayong; Lee, Younjoo; Lee, Gunjun; Kyung, Kwanhee; Kim, Sangpyo; Kim, Nam Sung; Kim, Jongmin; Kim, Hyungyo; Cho, Juhwan; Baek, Seungmin; Ahn, Jung Ho

计算机科学 > 硬件架构

arXiv:2507.15465 (cs)

[提交于 2025年7月21日 ]

标题：新的LLM瓶颈：对潜在注意力和专家混合的系统视角

标题： The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts

Authors:Sungmin Yun, Seonyong Park, Hwayong Nam, Younjoo Lee, Gunjun Lee, Kwanhee Kyung, Sangpyo Kim, Nam Sung Kim, Jongmin Kim, Hyungyo Kim, Juhwan Cho, Seungmin Baek, Jung Ho Ahn

摘要：传统Transformer模型的计算工作负载明显分为两部分。多头注意力（MHA）是内存受限的，算术强度低，而前馈层是计算受限的。这种差异长期以来促使研究专门的硬件以缓解MHA瓶颈。本文认为，最近的架构变化，即多头潜在注意力（MLA）和专家混合（MoE），挑战了专用注意力硬件的前提。我们做出两个关键观察。首先，MLA的算术强度比MHA高两个数量级，使其接近适合现代加速器（如GPU）的计算受限区域。其次，通过将MoE专家分布在加速器池中，可以通过批处理调整其算术强度以匹配密集层的算术强度，从而创建更平衡的计算配置。这些发现表明对专用注意力硬件的需求正在减少。下一代Transformer的主要挑战不再是加速单个内存受限层。相反，重点必须转向设计具有足够计算能力、内存容量、内存带宽和高带宽互连的平衡系统，以应对大规模模型的多样化需求。

摘要： Computational workloads composing traditional Transformer models are starkly bifurcated. Multi-Head Attention (MHA) is memory-bound, with low arithmetic intensity, while feedforward layers are compute-bound. This dichotomy has long motivated research into specialized hardware to mitigate the MHA bottleneck. This paper argues that recent architectural shifts, namely Multi-head Latent Attention (MLA) and Mixture-of-Experts (MoE), challenge the premise of specialized attention hardware. We make two key observations. First, the arithmetic intensity of MLA is over two orders of magnitude greater than that of MHA, shifting it close to a compute-bound regime well-suited for modern accelerators like GPUs. Second, by distributing MoE experts across a pool of accelerators, their arithmetic intensity can be tuned through batching to match that of the dense layers, creating a more balanced computational profile. These findings reveal a diminishing need for specialized attention hardware. The central challenge for next-generation Transformers is no longer accelerating a single memory-bound layer. Instead, the focus must shift to designing balanced systems with sufficient compute, memory capacity, memory bandwidth, and high-bandwidth interconnects to manage the diverse demands of large-scale models.

评论：	15页，11图
主题：	硬件架构 (cs.AR) ; 人工智能 (cs.AI)
引用方式：	arXiv:2507.15465 [cs.AR]
	(或者 arXiv:2507.15465v1 [cs.AR] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.15465

提交历史

来自： Jung Ho Ahn [查看电子邮件]
[v1] 星期一， 2025 年 7 月 21 日 10:18:33 UTC (1,282 KB)

计算机科学 > 硬件架构

标题：新的LLM瓶颈：对潜在注意力和专家混合的系统视角

标题： The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 硬件架构

标题： 新的LLM瓶颈：对潜在注意力和专家混合的系统视角 显示英文标题

标题： The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：新的LLM瓶颈：对潜在注意力和专家混合的系统视角