VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation

Xue, Yufei; Huang, Yushi; Shao, Jiawei; Zhang, Jun

Computer Science > Computer Vision and Pattern Recognition

arXiv:2508.03351 (cs)

[Submitted on 5 Aug 2025 ]

Title: VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation

Title: VLMQ：通过海森增强的大规模视觉-语言模型高效后训练量化

Authors:Yufei Xue, Yushi Huang, Jiawei Shao, Jun Zhang

Abstract: Post-training quantization (PTQ) has emerged as an effective approach for compressing large models and accelerating their inference without retraining. While PTQ has been extensively studied in the context of large language models (LLMs), its applicability to vision-language models (VLMs) remains underexplored. In this paper, we identify a modality discrepancy (\emph{i.e.}, limited text tokens \emph{vs.} excessive and redundant vision tokens) of VLMs. However, existing Hessian-based LLM PTQ methods treat all tokens equally during quantization, resulting in severe performance drops when applied to VLMs. Motivated by this observation, we propose a novel importance-aware PTQ framework tailored for VLMs, dubbed VLMQ. Specifically, to address vision token redundancy, VLMQ 1) optimizes an importance-aware objective that yields an enhanced Hessian with token-level importance factors, while retaining compatibility with parallelized weight updates, and 2) ensures efficiency and effectiveness by computing these factors via a single lightweight block-wise backward pass, guided by a theoretical connection to token-level perturbations. Extensive evaluations on 8 benchmarks across 0.5B$\sim$32B VLMs demonstrate the state-of-the-art (SOTA) performance of our VLMQ, particularly under low-bit settings. For example, it achieves a substantial \textbf{16.45\%} improvement on MME-RealWorld under 2-bit quantization.

Abstract: 后训练量化（PTQ）已成为一种有效的方法，用于压缩大型模型并在不重新训练的情况下加速其推理。尽管PTQ在大型语言模型（LLMs）的背景下已被广泛研究，但其在视觉-语言模型（VLMs）中的适用性仍鲜有探索。在本文中，我们识别了VLMs的模态差异（\emph{即}，有限的文本标记\emph{与...相比}过多且冗余的视觉标记）。然而，现有的基于Hessian的LLM PTQ方法在量化过程中对所有标记一视同仁，导致在应用于VLMs时性能严重下降。受这一观察的启发，我们提出了一种针对VLMs量身定制的新颖重要性感知PTQ框架，称为VLMQ。具体而言，为了解决视觉标记的冗余问题，VLMQ 1）优化了一个重要性感知的目标，该目标产生了带有标记级重要性因素的增强Hessian，同时保持与并行化权重更新的兼容性，并 2）通过一个轻量级的块状反向传递计算这些因素，该传递由标记级扰动的理论联系指导，从而确保效率和效果。在0.5B$\sim$32B VLMs的8个基准测试中进行的大量评估表明了我们VLMQ的最先进（SOTA）性能，特别是在低比特设置下。例如，在2比特量化下，它在MME-RealWorld上实现了显著的\textbf{16.45\%}提升。

Comments:	13 pages, 5 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2508.03351 [cs.CV]
	(or arXiv:2508.03351v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2508.03351

Submission history

From: Yufei Xue [view email]
[v1] Tue, 5 Aug 2025 11:57:03 UTC (4,675 KB)

Computer Science > Computer Vision and Pattern Recognition

Title: VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation

Title: VLMQ：通过海森增强的大规模视觉-语言模型高效后训练量化

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title: VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation Show Chinese title

Title: VLMQ：通过海森增强的大规模视觉-语言模型高效后训练量化

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation