Quantized Neural Network Inference with Precision Batching

Lam, Maximilian; Yedidia, Zachary; Banbury, Colby; Reddi, Vijay Janapa

计算机科学 > 机器学习

arXiv:2003.00822 (cs)

[提交于 2020年2月26日 ]

标题：带有精度分批的量化神经网络推理

标题： Quantized Neural Network Inference with Precision Batching

Authors:Maximilian Lam, Zachary Yedidia, Colby Banbury, Vijay Janapa Reddi

摘要：我们提出PrecisionBatching，这是一种量化推理算法，可在不重新训练或重新校准的情况下，加快在传统硬件平台上低比特宽度的神经网络执行速度。 PrecisionBatching将神经网络分解为单独的比特层，并使用快速的1位操作进行累积，同时保持激活值的全精度。 PrecisionBatching不仅可以在不重新训练/重新校准的情况下实现低比特宽度（<8位）的量化推理，而且还可以1）使传统硬件平台能够在更细粒度的量化（例如：1-16位执行）上实现推理加速，2）通过将要累积的比特层数作为可调参数暴露出来，在运行时实现准确性和加速之间的权衡。在各种应用（MNIST、语言建模、自然语言推理）和神经网络架构（全连接、RNN、LSTM）中，PrecisionBatching在GPU上的端到端加速超过8倍，误差范围小于1%的全精度基线，其性能在相同误差容限下比传统的8位量化推理高出1.5倍至2倍。

摘要： We present PrecisionBatching, a quantized inference algorithm for speeding up neural network execution on traditional hardware platforms at low bitwidths without the need for retraining or recalibration. PrecisionBatching decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations while maintaining activations in full precision. PrecisionBatching not only facilitates quantized inference at low bitwidths (< 8 bits) without the need for retraining/recalibration, but also 1) enables traditional hardware platforms the ability to realize inference speedups at a finer granularity of quantization (e.g: 1-16 bit execution) and 2) allows accuracy and speedup tradeoffs at runtime by exposing the number of bitlayers to accumulate as a tunable parameter. Across a variety of applications (MNIST, language modeling, natural language inference) and neural network architectures (fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of over 8x on a GPU within a < 1% error margin of the full precision baseline, outperforming traditional 8-bit quantized inference by over 1.5x-2x at the same error tolerance.

主题：	机器学习 (cs.LG) ; 计算机视觉与模式识别 (cs.CV); 性能 (cs.PF)
引用方式：	arXiv:2003.00822 [cs.LG]
	(或者 arXiv:2003.00822v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2003.00822

提交历史

来自： Maximilian Lam [查看电子邮件]
[v1] 星期三， 2020 年 2 月 26 日 19:34:11 UTC (107 KB)

计算机科学 > 机器学习

标题：带有精度分批的量化神经网络推理

标题： Quantized Neural Network Inference with Precision Batching

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 带有精度分批的量化神经网络推理 显示英文标题

标题： Quantized Neural Network Inference with Precision Batching

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：带有精度分批的量化神经网络推理