Pushing the Envelope of LLM Inference on AI-PC

Georganas, Evangelos; Kalamkar, Dhiraj; Heinecke, Alexander

计算机科学 > 人工智能

arXiv:2508.06753 (cs)

[提交于 2025年8月8日 ]

标题：推动AI-PC上LLM推理的边界

标题： Pushing the Envelope of LLM Inference on AI-PC

Authors:Evangelos Georganas, Dhiraj Kalamkar, Alexander Heinecke

摘要：超低比特大语言模型（1/1.58/2比特）的出现，使用相同的模型大小就能达到全精度模型的困惑度和最终任务性能，正在为资源受限环境（如边缘设备和AI PC）带来大语言模型推理的新时代。尽管这些量化进展在延迟、内存、吞吐量和能耗方面承诺了更具成本效益的模型，但用于部署它们的最新状态推理运行时（例如bitnet.cpp）的计算效率仍鲜有探索。在本工作中，我们采用自下而上的方法：我们首先设计并实现了针对现代CPU优化的1比特和2比特微内核，在多种CPU平台上实现了峰值计算效率。我们将这些微内核集成到最先进的大语言模型推理框架中，即PyTorch-TPP，并展示了使用2比特模型的端到端推理结果，其性能比当前最先进的运行时bitnet.cpp高出最多2.2倍，并且与16比特模型推理相比，速度提升了最多7倍。我们的优化运行时推进了AI PC和边缘设备上的大语言模型推理水平，为超低比特大语言模型的高效部署铺平了道路。

摘要： The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments such as edge devices and AI PCs. While these quantization advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art LLM inference framework, namely PyTorch-TPP, and present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet.cpp by up to 2.2x, and deliver up to 7x speedup compared to the 16-bit model inference. Our optimized runtime advances the state of LLM inference on AI PCs and edge devices, paving the way for efficient deployment of ultra-low-bit LLM models.

主题：	人工智能 (cs.AI) ; 机器学习 (cs.LG); 性能 (cs.PF)
引用方式：	arXiv:2508.06753 [cs.AI]
	(或者 arXiv:2508.06753v1 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.06753

提交历史

来自： Evangelos Georganas [查看电子邮件]
[v1] 星期五， 2025 年 8 月 8 日 23:33:38 UTC (324 KB)

计算机科学 > 人工智能

标题：推动AI-PC上LLM推理的边界

标题： Pushing the Envelope of LLM Inference on AI-PC

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： 推动AI-PC上LLM推理的边界 显示英文标题

标题： Pushing the Envelope of LLM Inference on AI-PC

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：推动AI-PC上LLM推理的边界