Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

Shi, Jinliang; Li, Shigang; Xu, Youxuan; Wang, Xueying; Fu, Rongtian; Ma, Zhi; Wu, Tong

计算机科学 > 分布式、并行与集群计算

arXiv:2506.22714 (cs)

[提交于 2025年6月28日 ]

标题： Libra：协同CUDA和张量核心实现高性能稀疏矩阵乘法

标题： Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

Authors:Jinliang Shi, Shigang Li, Youxuan Xu, Xueying Wang, Rongtian Fu, Zhi Ma, Tong Wu

摘要：稀疏矩阵乘法运算符（即SpMM和SDDMM）在深度学习和科学计算中被广泛使用。现代加速器通常配备张量核心和CUDA核心以加速稀疏运算符。前者提供了卓越的计算能力，但仅适用于结构化矩阵乘法，而后者性能相对较低，但具有更高的编程灵活性。在本工作中，我们发现单独利用一种资源会导致稀疏矩阵乘法性能不佳，这是由于它们各自的限制。为此，我们提出了Libra，一种系统性方法，能够在CUDA和张量核心之间实现协同计算，以达到稀疏矩阵乘法的最佳性能。具体来说，我们提出了一种二维感知的工作负载分配策略，以找出不同稀疏运算符的任务映射最佳点，同时利用张量核心的高性能和CUDA核心的低计算冗余。此外，Libra还包含了针对异构计算的系统优化，包括混合负载平衡、精细优化的内核实现以及GPU加速的预处理。在H100和RTX 4090 GPU上的大量实验结果表明，与最先进的方法相比，Libra在DTC-SpMM上平均快3.1倍（最高达9.23倍），在端到端图神经网络应用中快2.9倍（最高达3.9倍）。Libra通过充分利用GPU上的异构计算资源，为稀疏运算符加速开辟了新的视角。

摘要： Sparse matrix multiplication operators (i.e., SpMM and SDDMM) are widely used in deep learning and scientific computing. Modern accelerators are commonly equipped with Tensor cores and CUDA cores to accelerate sparse operators. The former brings superior computing power but only for structured matrix multiplication, while the latter has relatively lower performance but with higher programming flexibility. In this work, we discover that utilizing one resource alone leads to inferior performance for sparse matrix multiplication, due to their respective limitations. To this end, we propose Libra, a systematic approach that enables synergistic computation between CUDA and Tensor cores to achieve the best performance for sparse matrix multiplication. Specifically, we propose a 2D-aware workload distribution strategy to find out the sweet point of task mapping for different sparse operators, leveraging both the high performance of Tensor cores and the low computational redundancy on CUDA cores. In addition, Libra incorporates systematic optimizations for heterogeneous computing, including hybrid load-balancing, finely optimized kernel implementations, and GPU-accelerated preprocessing. Extensive experimental results on H100 and RTX 4090 GPUs show that Libra outperforms the state-of-the-art by on average 3.1x (up to 9.23x) over DTC-SpMM and 2.9x (up to 3.9x) for end-to-end GNN applications. Libra opens up a new perspective for sparse operator acceleration by fully exploiting the heterogeneous computing resources on GPUs.

主题：	分布式、并行与集群计算 (cs.DC) ; 机器学习 (cs.LG); 性能 (cs.PF)
ACM 类：	C.1.4; I.2.11
引用方式：	arXiv:2506.22714 [cs.DC]
	(或者 arXiv:2506.22714v1 [cs.DC] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.22714

提交历史

来自： Jinliang Shi [查看电子邮件]
[v1] 星期六， 2025 年 6 月 28 日 01:50:13 UTC (4,082 KB)

计算机科学 > 分布式、并行与集群计算

标题： Libra：协同CUDA和张量核心实现高性能稀疏矩阵乘法

标题： Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 分布式、并行与集群计算

标题： Libra：协同CUDA和张量核心实现高性能稀疏矩阵乘法 显示英文标题

标题： Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： Libra：协同CUDA和张量核心实现高性能稀疏矩阵乘法