Segmented Operations using Matrix Multiplications

Sobczyk, Aleksandros; Sorrentino, Giuseppe; Zouzias, Anastasios

计算机科学 > 数据结构与算法

arXiv:2506.23906 (cs)

[提交于 2025年6月30日 ]

标题：使用矩阵乘法的分段操作

标题： Segmented Operations using Matrix Multiplications

Authors:Aleksandros Sobczyk, Giuseppe Sorrentino, Anastasios Zouzias

摘要：专门执行小矩阵乘法作为基本操作的计算单元通常存在于现代加速器中。然而，这些单元在许多基本操作中除了密集矩阵乘法之外常常被低估使用。由于缺乏一个能捕捉其特性的严格理论计算模型，目前对这类架构的算法分析处于停滞状态。在本工作中，我们提出了 MMV-RAM，一种针对矩阵乘法加速器的计算模型。 MMV-RAM谨慎地扩展了 Vector-RAM 模型，增加了一个额外的处理单元，该单元可以在单个并行步骤中对大小为 $n\times s$ 和 $s\times s$ 的两个矩阵进行相乘，其中 $s$ 是一个模型参数。我们提供了该模型的详细理论分析，并在矩阵和向量单元之间仔细平衡计算能力，这是由奇偶性不在 AC[0] 中的电路复杂度下限所指导的。在 MMV-RAM 中，我们研究了分段扫描和求和的算法，这两个是基本的并行原语。我们提出了一种分段扫描算法，该算法使用矩阵乘法来执行推测性块扫描计算，运行时间为 $O(\log_s(n))$ 步。相比之下，我们证明任何仅使用 MMV-RAM 向量单元的算法都需要 $\Omega\left(\frac{\log_2(n)}{\log_2\log_2(n)}\right)$ 步。我们进一步应用这些技术，以获得元素级向量乘法和矩阵乘法的类似理论加速效果。除了最坏情况下的复杂度分析外，我们提出了可用于高效且实际实现的分段操作算法。例如，我们观察到分段求和是三个基本并行原语的组合：扫描、压缩和向量微分。作为案例研究，我们实现了...

摘要： Specialized computational units that perform small matrix multiplications as primitive operations are typically present in modern accelerators. However, these units are often underutilized for many fundamental operations besides dense matrix multiplications. The analysis of algorithms for such architectures is currently stagnated due to the lack of a rigorous theoretical model of computation that captures their characteristics. In this work, we propose MMV-RAM, a computational model tailored to matrix multiplication accelerators. MMV-RAM judiciously extends the Vector-RAM model with an additional processing unit that multiplies two matrices of sizes $n\times s$ and $s\times s$ in a single parallel step, where $s$ is a model parameter. We provide a detailed theoretical analysis of the model, and carefully balance the computational power between the matrix and vector units, guided by the circuit complexity lower bound that parity is not in AC[0]. In MMV-RAM, we study algorithms for segmented scan and sum, two fundamental parallel primitives. We propose a segmented scan algorithm that uses matrix multiplications to perform speculative block-scan computations, which runs in $O(\log_s(n))$ steps. In contrast, we show that any algorithm that uses only the vector unit of MMV-RAM requires $\Omega\left(\frac{\log_2(n)}{\log_2\log_2(n)}\right)$ steps. We further apply these techniques to obtain similar theoretical speedups for element-wise vector multiplication and matrix multiplication. Beyond the worst-case complexity analysis, we propose algorithms for segmented operations that could lead to highly efficient and pragmatic implementations. For example, we observe that segmented sum is a combination of three elementary parallel primitives: scan, compress, and vector differentiation. As a case study, we implement...

主题：	数据结构与算法 (cs.DS) ; 计算复杂性 (cs.CC); 分布式、并行与集群计算 (cs.DC)
引用方式：	arXiv:2506.23906 [cs.DS]
	(或者 arXiv:2506.23906v1 [cs.DS] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.23906

提交历史

来自： Aleksandros Sobczyk [查看电子邮件]
[v1] 星期一， 2025 年 6 月 30 日 14:36:44 UTC (504 KB)

计算机科学 > 数据结构与算法

标题：使用矩阵乘法的分段操作

标题： Segmented Operations using Matrix Multiplications

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 数据结构与算法

标题： 使用矩阵乘法的分段操作 显示英文标题

标题： Segmented Operations using Matrix Multiplications

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：使用矩阵乘法的分段操作