Performance Portable Gradient Computations Using Source Transformation

Liegeois, Kim; Kelley, Brian; Phipps, Eric; Rajamanickam, Sivasankaran; Vassilev, Vassil

计算机科学 > 数学软件

arXiv:2507.13204 (cs)

[提交于 2025年7月17日 ]

标题：基于源代码转换的性能可移植梯度计算

标题： Performance Portable Gradient Computations Using Source Transformation

Authors:Kim Liegeois, Brian Kelley, Eric Phipps, Sivasankaran Rajamanickam, Vassil Vassilev

摘要：导数计算是优化、灵敏度分析、不确定性量化和非线性求解器的关键组成部分。自动微分（AD）是一种用于评估此类导数的强大技术，近年来已被集成到Jax、PyTorch和TensorFlow等编程环境中，以支持机器学习模型训练所需的导数计算，从而导致这些技术的广泛应用。C++语言由于众多因素已成为科学计算的事实标准，但语言复杂性使得在C++中采用AD技术变得困难，阻碍了强大的可微编程方法在C++科学模拟中的应用。随着GPU等架构的日益出现，这种情况变得更加严重，这些架构内存能力有限，需要大量的线程级并发。便携式科学代码依赖于领域特定的编程模型，如Kokkos，使得针对此类代码的AD更加复杂。在本文中，我们将研究使用Clad的源代码转换自动微分，以自动生成基于Kokkos代码的便携且高效的梯度计算。我们将讨论对Clad进行的修改，以区分Kokkos抽象。我们将通过将生成的梯度代码的时钟时间与输入函数在NVIDIA H100、AMD MI250x和Intel Ponte Vecchio GPU等前沿GPU架构上的时钟时间进行比较，来说明我们提出策略的可行性。对于这三种架构和所考虑的示例，评估最多10 000个梯度条目仅用了评估输入函数时钟时间的2.17倍。

摘要： Derivative computation is a key component of optimization, sensitivity analysis, uncertainty quantification, and nonlinear solvers. Automatic differentiation (AD) is a powerful technique for evaluating such derivatives, and in recent years, has been integrated into programming environments such as Jax, PyTorch, and TensorFlow to support derivative computations needed for training of machine learning models, resulting in widespread use of these technologies. The C++ language has become the de facto standard for scientific computing due to numerous factors, yet language complexity has made the adoption of AD technologies for C++ difficult, hampering the incorporation of powerful differentiable programming approaches into C++ scientific simulations. This is exacerbated by the increasing emergence of architectures such as GPUs, which have limited memory capabilities and require massive thread-level concurrency. Portable scientific codes rely on domain specific programming models such as Kokkos making AD for such codes even more complex. In this paper, we will investigate source transformation-based automatic differentiation using Clad to automatically generate portable and efficient gradient computations of Kokkos-based code. We discuss the modifications of Clad required to differentiate Kokkos abstractions. We will illustrate the feasibility of our proposed strategy by comparing the wall-clock time of the generated gradient code with the wall-clock time of the input function on different cutting edge GPU architectures such as NVIDIA H100, AMD MI250x, and Intel Ponte Vecchio GPU. For these three architectures and for the considered example, evaluating up to 10 000 entries of the gradient only took up to 2.17x the wall-clock time of evaluating the input function.

主题：	数学软件 (cs.MS)
引用方式：	arXiv:2507.13204 [cs.MS]
	(或者 arXiv:2507.13204v1 [cs.MS] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.13204

提交历史

来自： Kim Liegeois [查看电子邮件]
[v1] 星期四， 2025 年 7 月 17 日 15:15:25 UTC (32 KB)

计算机科学 > 数学软件

标题：基于源代码转换的性能可移植梯度计算

标题： Performance Portable Gradient Computations Using Source Transformation

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 数学软件

标题： 基于源代码转换的性能可移植梯度计算 显示英文标题

标题： Performance Portable Gradient Computations Using Source Transformation

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基于源代码转换的性能可移植梯度计算