Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

Venkat, Sreeram; Swirydowicz, Kasia; Wolfe, Noah; Ghattas, Omar

计算机科学 > 分布式、并行与集群计算

arXiv:2508.10202 (cs)

[提交于 2025年8月13日 ]

标题：基于FFT的块三角托普利茨矩阵GPU加速算法的混合精度性能可移植性

标题： Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

Authors:Sreeram Venkat, Kasia Swirydowicz, Noah Wolfe, Omar Ghattas

摘要：在领导级计算设施中展示的硬件多样性，以及当今GPU在低精度计算时表现出的巨大性能提升，为科学HPC工作流采用混合精度算法和性能可移植性模型提供了强有力的激励。我们提出了一种实时框架，使用Hipify实现性能可移植性，并将其应用于FFTMatvec——一个使用块三角Toeplitz矩阵计算矩阵-向量乘积的HPC应用。我们的方法使FFTMatvec（最初是一个仅CUDA的应用程序）能够在AMD GPU上无缝运行，并表现出优异的性能。针对AMD GPU的性能优化直接集成到开源rocBLAS库中，保持应用程序代码不变。然后我们提出了一个动态混合精度框架用于FFTMatvec；帕累托前沿分析确定了对于所需误差容限的最佳混合精度配置。结果展示了AMD Instinct MI250X、MI300X和新发布的MI355X GPU。具有性能可移植性的混合精度FFTMatvec已在OLCF Frontier超级计算机上扩展到2048个GPU。

摘要： The hardware diversity displayed in leadership-class computing facilities, alongside the immense performance boosts exhibited by today's GPUs when computing in lower precision, provide a strong incentive for scientific HPC workflows to adopt mixed-precision algorithms and performance portability models. We present an on-the-fly framework using Hipify for performance portability and apply it to FFTMatvec-an HPC application that computes matrix-vector products with block-triangular Toeplitz matrices. Our approach enables FFTMatvec, initially a CUDA-only application, to run seamlessly on AMD GPUs with excellent observed performance. Performance optimizations for AMD GPUs are integrated directly into the open-source rocBLAS library, keeping the application code unchanged. We then present a dynamic mixed-precision framework for FFTMatvec; a Pareto front analysis determines the optimal mixed-precision configuration for a desired error tolerance. Results are shown for AMD Instinct MI250X, MI300X, and the newly launched MI355X GPUs. The performance-portable, mixed-precision FFTMatvec is scaled to 2,048 GPUs on the OLCF Frontier supercomputer.

主题：	分布式、并行与集群计算 (cs.DC) ; 性能 (cs.PF); 数值分析 (math.NA)
MSC 类：	65Y20, 65Y05, 65Y10, 68Q25, 68W40, 65M32, 5B05
ACM 类：	F.2; G.4; C.4
引用方式：	arXiv:2508.10202 [cs.DC]
	(或者 arXiv:2508.10202v1 [cs.DC] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.10202

提交历史

来自： Sreeram Venkat [查看电子邮件]
[v1] 星期三， 2025 年 8 月 13 日 21:29:26 UTC (73 KB)

计算机科学 > 分布式、并行与集群计算

标题：基于FFT的块三角托普利茨矩阵GPU加速算法的混合精度性能可移植性

标题： Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 分布式、并行与集群计算

标题： 基于FFT的块三角托普利茨矩阵GPU加速算法的混合精度性能可移植性 显示英文标题

标题： Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基于FFT的块三角托普利茨矩阵GPU加速算法的混合精度性能可移植性