A Flexible Instruction Set Architecture for Efficient GEMMs

Santana, Alexandre de Limas; Armejach, Adrià; Martinez, Francesc; Focht, Erich; Casas, Marc

计算机科学 > 硬件架构

arXiv:2507.03522 (cs)

[提交于 2025年7月4日 ]

标题：一种用于高效GEMMs的灵活指令集架构

标题： A Flexible Instruction Set Architecture for Efficient GEMMs

Authors:Alexandre de Limas Santana, Adrià Armejach, Francesc Martinez, Erich Focht, Marc Casas

摘要：通用矩阵乘法（GEMMs）在高性能计算和深度学习工作负载中经常出现。通常，高端CPU使用单指令多数据（SIMD）或向量指令集架构（ISAs）来加速GEMM工作负载。由于这些ISAs在运行GEMM工作负载时，尤其是在处理小矩阵、高矩阵或瘦矩阵时面临重大问题，近年来主要硬件供应商已提出并实现了矩阵ISAs。尽管这些矩阵ISAs在运行GEMMs时比其SIMD/向量 counterparts 有更高的吞吐量，但它们是刚性解决方案，无法动态适应应用特定的方面，如数据格式。本文表明，当运行最常用的卷积和转换器模型时，最先进的矩阵ISAs表现不佳。本文提出了矩阵块扩展（MTE），这是第一个完全将指令集架构与微架构解耦并与现有向量ISAs无缝交互的矩阵ISA。 MTE的实现开销很小，因为它只需要几个额外的指令和一个64位控制状态寄存器（CSR）来保持其状态。具体来说，MTE可以i）在三个维度M、N和K上对GEMMs进行向量化；ii）利用现有向量寄存器文件的容量；iii）将块形状与底层微架构解耦。 MTE在最佳现有矩阵ISA上的速度提高了1.35倍。

摘要： GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Architectures (ISAs). Since these ISAs face significant issues when running GEMM workloads, particularly when dealing with small, tall, or skinny matrices, matrix ISAs have been proposed and implemented by major hardware vendors in the last years. Although these matrix ISAs deliver larger throughput when running GEMMs than their SIMD/vector counterparts, they are rigid solutions unable to dynamically adapt themselves to application-specific aspects like the data format. This paper demonstrates that the state-of-the-art matrix ISAs deliver suboptimal performance when running the most commonly used convolution and transformer models. This paper proposes the Matrix Tile Extension (MTE), the first matrix ISA that completely decouples the instruction set architecture from the microarchitecture and seamlessly interacts with existing vector ISAs. MTE incurs minimal implementation overhead since it only requires a few additional instructions and a 64-bit Control Status Register (CSR) to keep its state. Specifically, MTE can i) vectorize GEMMs across the three dimensions M, N, and K; ii) leverage the capacity of the existing vector register file; and iii) decouple the tile shape from the underlying microarchitecture. MTE achieves speed-ups of 1.35x over the best state-of-the-art matrix ISA.

主题：	硬件架构 (cs.AR) ; 机器学习 (cs.LG)
ACM 类：	C.1.0
引用方式：	arXiv:2507.03522 [cs.AR]
	(或者 arXiv:2507.03522v1 [cs.AR] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.03522

提交历史

来自： Alexandre Limas Santana [查看电子邮件]
[v1] 星期五， 2025 年 7 月 4 日 12:17:00 UTC (295 KB)

计算机科学 > 硬件架构

标题：一种用于高效GEMMs的灵活指令集架构

标题： A Flexible Instruction Set Architecture for Efficient GEMMs

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 硬件架构

标题： 一种用于高效GEMMs的灵活指令集架构 显示英文标题

标题： A Flexible Instruction Set Architecture for Efficient GEMMs

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：一种用于高效GEMMs的灵活指令集架构