Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > cs > arXiv:2508.03351

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

arXiv:2508.03351 (cs)
[Submitted on 5 Aug 2025 ]

Title: VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation

Title: VLMQ:通过海森增强的大规模视觉-语言模型高效后训练量化

Authors:Yufei Xue, Yushi Huang, Jiawei Shao, Jun Zhang
Abstract: Post-training quantization (PTQ) has emerged as an effective approach for compressing large models and accelerating their inference without retraining. While PTQ has been extensively studied in the context of large language models (LLMs), its applicability to vision-language models (VLMs) remains underexplored. In this paper, we identify a modality discrepancy (\emph{i.e.}, limited text tokens \emph{vs.} excessive and redundant vision tokens) of VLMs. However, existing Hessian-based LLM PTQ methods treat all tokens equally during quantization, resulting in severe performance drops when applied to VLMs. Motivated by this observation, we propose a novel importance-aware PTQ framework tailored for VLMs, dubbed VLMQ. Specifically, to address vision token redundancy, VLMQ 1) optimizes an importance-aware objective that yields an enhanced Hessian with token-level importance factors, while retaining compatibility with parallelized weight updates, and 2) ensures efficiency and effectiveness by computing these factors via a single lightweight block-wise backward pass, guided by a theoretical connection to token-level perturbations. Extensive evaluations on 8 benchmarks across 0.5B$\sim$32B VLMs demonstrate the state-of-the-art (SOTA) performance of our VLMQ, particularly under low-bit settings. For example, it achieves a substantial \textbf{16.45\%} improvement on MME-RealWorld under 2-bit quantization.
Abstract: 后训练量化(PTQ)已成为一种有效的方法,用于压缩大型模型并在不重新训练的情况下加速其推理。尽管PTQ在大型语言模型(LLMs)的背景下已被广泛研究,但其在视觉-语言模型(VLMs)中的适用性仍鲜有探索。在本文中,我们识别了VLMs的模态差异(\emph{即},有限的文本标记\emph{与...相比}过多且冗余的视觉标记)。然而,现有的基于Hessian的LLM PTQ方法在量化过程中对所有标记一视同仁,导致在应用于VLMs时性能严重下降。受这一观察的启发,我们提出了一种针对VLMs量身定制的新颖重要性感知PTQ框架,称为VLMQ。具体而言,为了解决视觉标记的冗余问题,VLMQ 1)优化了一个重要性感知的目标,该目标产生了带有标记级重要性因素的增强Hessian,同时保持与并行化权重更新的兼容性,并 2)通过一个轻量级的块状反向传递计算这些因素,该传递由标记级扰动的理论联系指导,从而确保效率和效果。在0.5B$\sim$32B VLMs的8个基准测试中进行的大量评估表明了我们VLMQ的最先进(SOTA)性能,特别是在低比特设置下。例如,在2比特量化下,它在MME-RealWorld上实现了显著的\textbf{16.45\%}提升。
Comments: 13 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as: arXiv:2508.03351 [cs.CV]
  (or arXiv:2508.03351v1 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2508.03351
arXiv-issued DOI via DataCite

Submission history

From: Yufei Xue [view email]
[v1] Tue, 5 Aug 2025 11:57:03 UTC (4,675 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
license icon view license
Current browse context:
cs.AI
< prev   |   next >
new | recent | 2025-08
Change to browse by:
cs
cs.CL
cs.CV

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号