APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

Ma, Shaobo; Fang, Chao; Shao, Haikuo; Wang, Zhongfeng

doi:10.1109/TCAD.2025.3604321

计算机科学 > 机器学习

arXiv:2508.19087 (cs)

[提交于 2025年8月26日 ]

标题： APT-LLM：利用任意精度张量核心计算加速大模型

标题： APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

Authors:Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang

摘要：大型语言模型（LLMs）已经革新了人工智能应用，但其巨大的计算需求严重限制了部署和实时性能。量化方法可以帮助降低计算成本，然而，在GPU上实现与超低比特量化LLM相关的极端效率仍面临挑战。这主要是由于GPU Tensor Core支持有限、内存管理效率低下以及内核优化不够灵活。为解决这些挑战，我们提出了一种针对任意精度LLM的全面加速方案，即APT-LLM。首先，我们引入了一种新的数据格式，双极性INT，它允许与带符号INT进行高效且无损的转换，同时更有利于并行计算。我们还开发了一种矩阵乘法（MatMul）方法，通过在位级别拆解和重新组装矩阵来实现任意精度。该方法提供了灵活的精度并优化了GPU Tensor Core的利用率。此外，我们提出了一种专注于数据恢复的内存管理系统，战略性地使用快速共享内存以显著提高内核执行速度并减少内存访问延迟。最后，我们开发了一种内核映射方法，可动态选择不同矩阵大小的最优可配置超参数，从而在不同的LLM架构和精度设置下实现最佳性能。在LLM推理中，APT-LLM相比FP16基线实现了最高3.99$\times$的加速，并在RTX 3090上相比NVIDIA CUTLASS INT4加速实现了最高2.16$\times$的加速。在RTX 4090和H800上，APT-LLM相比FP16实现了最高2.44$\times$的加速，并相比CUTLASS整数基线实现了最高1.65$\times$的加速。

摘要： Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the extreme efficiency associated with ultra-low-bit quantized LLMs at arbitrary precision presents challenges on GPUs. This is primarily due to the limited support for GPU Tensor Cores, inefficient memory management, and inflexible kernel optimizations. To tackle these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM. Firstly, we introduce a novel data format, bipolar-INT, which allows for efficient and lossless conversion with signed INT, while also being more conducive to parallel computation. We also develop a matrix multiplication (MatMul) method allowing for arbitrary precision by dismantling and reassembling matrices at the bit level. This method provides flexible precision and optimizes the utilization of GPU Tensor Cores. In addition, we propose a memory management system focused on data recovery, which strategically employs fast shared memory to substantially increase kernel execution speed and reduce memory access latency. Finally, we develop a kernel mapping method that dynamically selects the optimal configurable hyperparameters of kernels for varying matrix sizes, enabling optimal performance across different LLM architectures and precision settings. In LLM inference, APT-LLM achieves up to a 3.99$\times$ speedup compared to FP16 baselines and a 2.16$\times$ speedup over NVIDIA CUTLASS INT4 acceleration on RTX 3090. On RTX 4090 and H800, APT-LLM achieves up to 2.44$\times$ speedup over FP16 and 1.65$\times$ speedup over CUTLASS integer baselines.

评论：	将出现在IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)上
主题：	机器学习 (cs.LG) ; 人工智能 (cs.AI); 硬件架构 (cs.AR)
引用方式：	arXiv:2508.19087 [cs.LG]
	(或者 arXiv:2508.19087v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.19087
相关 DOI:	https://doi.org/10.1109/TCAD.2025.3604321

提交历史

来自： Shaobo Ma [查看电子邮件]
[v1] 星期二， 2025 年 8 月 26 日 14:48:29 UTC (1,522 KB)

计算机科学 > 机器学习

标题： APT-LLM：利用任意精度张量核心计算加速大模型

标题： APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： APT-LLM：利用任意精度张量核心计算加速大模型 显示英文标题

标题： APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： APT-LLM：利用任意精度张量核心计算加速大模型