Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective

Benazir, Afsara; Lin, Felix Xiaozhu

计算机科学 > 性能

arXiv:2508.08531 (cs)

[提交于 2025年8月12日 ]

标题：在苹果硅芯片上对大型语言模型推理的分析：量化视角

标题： Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective

Authors:Afsara Benazir, Felix Xiaozhu Lin

摘要：当前硬件效率领域对Apple Silicon的系统性理解仍然缺乏；研究重点主要集中在加速GPU以在CUDA设备上进行大规模训练或推理。本文研究了Apple Silicon独特的内存架构，该架构提供了集成CPU和GPU内存的统一内存，并探讨了其对设备端大型语言模型推理的影响。我们通过直接进行延迟和吞吐量比较基准测试，解开了关于Apple Silicon在设备端推理方面是否比NVIDIA GPU等竞争对手高效的误解。我们通过分析低级硬件指标——ALU利用率、内存带宽、缓冲区使用情况、缓存驻留等来解释它们之间的性能差距。我们在运行时对性能瓶颈进行了深入分析，例如解量化开销、计算吞吐量和内存带宽。我们驳斥了关于大型语言模型推理的现有错误说法，例如将模型压缩到较低位精度是跨所有硬件平台实现更快推理的既定承诺。我们发现，大容量统一内存使Apple Silicon在超大规模语言模型方面相对于NVIDIA GPU既具有成本效益又高效。我们在5个硬件测试平台上的大规模评估包括三种Apple M系列设备：M2 Ultra、M2 Max和M4 Pro，以及两种NVIDIA GPU：NVIDIA RTX A6000，一个由2xNVIDIA RTX A6000组成的多GPU设置，5种模型规模，参数范围从8B到405B，以及14种量化方案，从而了解Apple Silicon在设备端大型语言模型推理范式中的定位。我们的分析揭示了多种资源相互依赖关系和意外发现，同时量化了已有的见解。据我们所知，这项研究首次尝试对设备端推理的Apple Silicon进行全面的表征和分析。

摘要： A systematic understanding of Apple Silicon is lacking in the current landscape of hardware efficiency; research focus is largely centered on accelerating GPUs for large-scale training or inference on CUDA devices. This paper investigates Apple Silicon's unique memory architecture that offers a unified memory integrating CPU and GPU memory and its implications for on-device LLM inference. We decipher myths about whether Apple Silicon is efficient for on-device inference compared to competitors such as NVIDIA GPUs by directly conducting latency and throughput comparison benchmarks. We explain the performance gap between them through profiling low level hardware metrics - ALU utilization, memory bandwidth, buffer usage, cache residency etc. at runtime. We draw several insights regarding performance bottlenecks such as dequantization overhead, compute throughput and memory bandwidth. We debunk existing false claims regarding large language model inference such as compressing models to lower bit precision is a defacto promise for faster inference across all hardware platforms. We find that the large unified memory enables Apple Silicon to be both cost effective and efficient against NVIDIA GPUs for ultra large language models. Our large scale evaluation on 5 hardware testbeds incorporating three Apple M-series devices: M2 Ultra, M2 Max and M4 Pro and two NVIDIA GPUs: NVIDIA RTX A6000, a multi GPU setup with 2xNVIDIA RTX A6000, 5 model scales ranging from 8B to 405B parameters and 14 quantization schemes gives an understanding of how Apple Silicon fits within the paradigm of on-device LLM inference. Our analysis reveals multiple resource interdependencies and unexpected findings, while also quantifying established insights. To the best of our knowledge, this study makes the first attempt to present a thorough characterization and analysis of Apple Silicon for on-device inference.

主题：	性能 (cs.PF)
引用方式：	arXiv:2508.08531 [cs.PF]
	(或者 arXiv:2508.08531v1 [cs.PF] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.08531

提交历史

来自： Afsara Benazir [查看电子邮件]
[v1] 星期二， 2025 年 8 月 12 日 00:06:34 UTC (5,606 KB)

计算机科学 > 性能

标题：在苹果硅芯片上对大型语言模型推理的分析：量化视角

标题： Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 性能

标题： 在苹果硅芯片上对大型语言模型推理的分析：量化视角 显示英文标题

标题： Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：在苹果硅芯片上对大型语言模型推理的分析：量化视角