Characterizing State Space Model (SSM) and SSM-Transformer Hybrid Language Model Performance with Long Context Length

Mitra, Saptarshi; Karami, Rachid; Xu, Haocheng; Huang, Sitao; Kwon, Hyoukjun

计算机科学 > 硬件架构

arXiv:2507.12442 (cs)

[提交于 2025年7月16日 ]

标题：用长上下文长度表征状态空间模型（SSM）和SSM-Transformer混合语言模型的性能

标题： Characterizing State Space Model (SSM) and SSM-Transformer Hybrid Language Model Performance with Long Context Length

Authors:Saptarshi Mitra, Rachid Karami, Haocheng Xu, Sitao Huang, Hyoukjun Kwon

摘要：机器智能在本地设备上处理连续、长上下文输入的需求正在迅速增长。然而，传统Transformer架构的二次复杂度和内存需求使其在这些任务中效率低下且通常不可用。这促使人们转向新的架构，如状态空间模型（SSMs）和混合模型，这些模型有望实现近线性扩展。尽管当前大多数研究集中在这些模型的准确性和理论吞吐量上，但在实际消费级硬件上的系统性能表征对于指导系统级优化和解锁新应用至关重要。为解决这一差距，我们提出了一个全面的比较基准测试，专门针对消费级和嵌入式GPU上的长上下文推理，对精心选择的Transformer、SSM和混合模型进行了评估。我们的分析表明， SSMs不仅可行，而且在此领域表现更优，能够在24GB消费级GPU上处理长达220K个标记的序列，大约比相应的Transformer长4倍。虽然Transformer在短序列上可能快达1.8倍，但SSMs在非常长的上下文（约57K个标记）中表现出显著的性能反转，快达4倍。我们的操作级别分析表明，定制的、硬件感知的SSM内核主导了推理运行时间，在边缘平台上的延迟占比超过55%，表明它们是未来硬件加速的主要目标。我们还提供了详细的、特定于设备的表征结果，以指导边缘系统的协同设计。为了促进进一步的研究，我们将开源我们的表征框架。

摘要： The demand for machine intelligence capable of processing continuous, long-context inputs on local devices is growing rapidly. However, the quadratic complexity and memory requirements of traditional Transformer architectures make them inefficient and often unusable for these tasks. This has spurred a paradigm shift towards new architectures like State Space Models (SSMs) and hybrids, which promise near-linear scaling. While most current research focuses on the accuracy and theoretical throughput of these models, a systematic performance characterization on practical consumer hardware is critically needed to guide system-level optimization and unlock new applications. To address this gap, we present a comprehensive, comparative benchmarking of carefully selected Transformer, SSM, and hybrid models specifically for long-context inference on consumer and embedded GPUs. Our analysis reveals that SSMs are not only viable but superior for this domain, capable of processing sequences up to 220K tokens on a 24GB consumer GPU-approximately 4x longer than comparable Transformers. While Transformers may be up to 1.8x faster at short sequences, SSMs demonstrate a dramatic performance inversion, becoming up to 4x faster at very long contexts (~57K tokens). Our operator-level analysis reveals that custom, hardware-aware SSM kernels dominate the inference runtime, accounting for over 55% of latency on edge platforms, identifying them as a primary target for future hardware acceleration. We also provide detailed, device-specific characterization results to guide system co-design for the edge. To foster further research, we will open-source our characterization framework.

评论：	12页，7图
主题：	硬件架构 (cs.AR) ; 人工智能 (cs.AI); 机器学习 (cs.LG); 系统与控制 (eess.SY)
引用方式：	arXiv:2507.12442 [cs.AR]
	(或者 arXiv:2507.12442v1 [cs.AR] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.12442

提交历史

来自： Saptarshi Mitra [查看电子邮件]
[v1] 星期三， 2025 年 7 月 16 日 17:28:40 UTC (9,235 KB)

计算机科学 > 硬件架构

标题：用长上下文长度表征状态空间模型（SSM）和SSM-Transformer混合语言模型的性能

标题： Characterizing State Space Model (SSM) and SSM-Transformer Hybrid Language Model Performance with Long Context Length

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 硬件架构

标题： 用长上下文长度表征状态空间模型（SSM）和SSM-Transformer混合语言模型的性能 显示英文标题

标题： Characterizing State Space Model (SSM) and SSM-Transformer Hybrid Language Model Performance with Long Context Length

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：用长上下文长度表征状态空间模型（SSM）和SSM-Transformer混合语言模型的性能