DGEMM without FP64 Arithmetic -- Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

Mukunoki, Daichi

计算机科学 > 性能

arXiv:2508.00441v2 (cs)

[提交于 2025年8月1日 (v1) ，最后修订 2025年8月9日 (此版本， v2)]

标题： DGEMM 不使用 FP64 算法 -- 使用 FP64 模拟和 FP8 张量核心的 Ozaki 方案

标题： DGEMM without FP64 Arithmetic -- Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

Authors:Daichi Mukunoki

摘要：随着对AI计算的需求迅速增加，越来越多的硬件被开发出来以高效执行此类工作负载所需的低精度矩阵乘法。然而，由于精度要求，这些操作通常不能直接应用于科学计算。 Ozaki方案——由Ozaki等人于2012年提出的一种精确的矩阵乘法方法——使得可以使用低精度矩阵乘法单元（如FP16 Tensor Cores）进行FP64矩阵乘法（DGEMM）。这种方法随后被扩展以利用整数运算，与基于浮点数的实现相比，计算成本更低。事实上，它在配备为AI工作负载设计的快速INT8 Tensor Cores的GPU上实现了比硬件FP64操作更高的性能。然而，最近的硬件趋势转向提升低精度浮点运算（如FP8）的性能，而不是整数运算。受这一转变的激励，本研究重新审视了Ozaki方案中低精度浮点运算的使用。具体而言，我们探讨了使用FP8 Tensor Cores进行DGEMM。此外，对于支持非常缓慢或没有FP64运算的处理器，我们还考虑了基于整数运算的FP64模拟。此外，我们探索了在内积方向上使用分块以加速基于FP16的实现。我们通过在NVIDIA Blackwell架构GPU上评估性能来证明这些方法的有效性。

摘要： As the demand for AI computations rapidly increases, more hardware is being developed to efficiently perform the low-precision matrix multiplications required by such workloads. However, these operations are generally not directly applicable to scientific computations due to accuracy requirements. The Ozaki scheme -- an accurate matrix multiplication method proposed by Ozaki et al. in 2012 -- enables FP64 matrix multiplication (DGEMM) using low-precision matrix multiplication units, such as FP16 Tensor Cores. This approach has since been extended to utilize integer arithmetic, offering lower computational cost compared to floating-point-based implementations. In fact, it has achieved higher performance than hardware FP64 operations on GPUs equipped with fast INT8 Tensor Cores designed for AI workloads. However, recent hardware trends have shifted toward improving the performance of low-precision floating-point operations, such as FP8, rather than integer operations. Motivated by this shift, this study revisits the use of low-precision floating-point operations in the Ozaki scheme. Specifically, we explore the use of FP8 Tensor Cores to perform DGEMM. In addition, for processors that support very slow or no FP64 operations, we also consider FP64 emulation based on integer arithmetic. Furthermore, we explore the use of blocking in the inner-product direction to accelerate FP16-based implementations. We demonstrate the effectiveness of these methods by evaluating the performance on an NVIDIA Blackwell architecture GPU.

主题：	性能 (cs.PF) ; 硬件架构 (cs.AR); 数学软件 (cs.MS)
引用方式：	arXiv:2508.00441 [cs.PF]
	(或者 arXiv:2508.00441v2 [cs.PF] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.00441

提交历史

来自： Daichi Mukunoki [查看电子邮件]
[v1] 星期五， 2025 年 8 月 1 日 08:58:00 UTC (82 KB)
[v2] 星期六， 2025 年 8 月 9 日 12:24:09 UTC (86 KB)

计算机科学 > 性能

标题： DGEMM 不使用 FP64 算法 -- 使用 FP64 模拟和 FP8 张量核心的 Ozaki 方案

标题： DGEMM without FP64 Arithmetic -- Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 性能

标题： DGEMM 不使用 FP64 算法 -- 使用 FP64 模拟和 FP8 张量核心的 Ozaki 方案 显示英文标题

标题： DGEMM without FP64 Arithmetic -- Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： DGEMM 不使用 FP64 算法 -- 使用 FP64 模拟和 FP8 张量核心的 Ozaki 方案