ProxyThinker: Test-Time Guidance through Small Visual Reasoners

Xiao, Zilin; Koo, Jaywon; Ouyang, Siru; Hernandez, Jefferson; Meng, Yu; Ordonez, Vicente

计算机科学 > 计算机视觉与模式识别

arXiv:2505.24872 (cs)

[提交于 2025年5月30日 ]

标题： ProxyThinker：通过小型视觉推理器进行测试时指导

标题： ProxyThinker: Test-Time Guidance through Small Visual Reasoners

Authors:Zilin Xiao, Jaywon Koo, Siru Ouyang, Jefferson Hernandez, Yu Meng, Vicente Ordonez

摘要：近年来，具有可验证奖励的强化学习取得了进展，推动了大型视觉语言模型（LVLMs）的视觉推理能力的发展。然而，使用强化微调（RFT）训练LVLMs计算成本高昂，这对扩展模型规模构成了重大挑战。在这项工作中，我们提出了ProxyThinker，这是一种推理时技术，使大型模型能够在无需任何训练的情况下继承来自小型、慢速推理视觉推理器的视觉推理能力。通过从RFT推理器的输出分布中减去基础模型的输出分布，ProxyThinker修改了解码动态，并成功激发了由出现的复杂行为（如自我验证和自我修正）所展示出的慢速推理。 ProxyThinker在空间、数学和跨学科推理的具有挑战性的视觉基准测试上始终提升了性能，使得未经调优的基础模型能够与全规模RFT对应模型的表现相媲美。此外，我们的实现利用并行技术高效协调多个语言模型，在推理速度上比先前的解码时方法快高达38 $\times$倍，为ProxyThinker的实际部署铺平了道路。代码可在https://github.com/MrZilinXiao/ProxyThinker获取。

摘要： Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 $\times$ faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://github.com/MrZilinXiao/ProxyThinker.

主题：	计算机视觉与模式识别 (cs.CV) ; 人工智能 (cs.AI); 计算与语言 (cs.CL)
引用方式：	arXiv:2505.24872 [cs.CV]
	(或者 arXiv:2505.24872v1 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2505.24872

提交历史

来自： Zilin Xiao [查看电子邮件]
[v1] 星期五， 2025 年 5 月 30 日 17:59:43 UTC (3,078 KB)

计算机科学 > 计算机视觉与模式识别

标题： ProxyThinker：通过小型视觉推理器进行测试时指导

标题： ProxyThinker: Test-Time Guidance through Small Visual Reasoners

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： ProxyThinker：通过小型视觉推理器进行测试时指导 显示英文标题

标题： ProxyThinker: Test-Time Guidance through Small Visual Reasoners

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： ProxyThinker：通过小型视觉推理器进行测试时指导