Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

Qin, Yulei; Li, Gang; Li, Zongyi; Xu, Zihan; Shi, Yuchen; Lin, Zhekai; Cui, Xiao; Li, Ke; Sun, Xing

计算机科学 > 计算机视觉与模式识别

arXiv:2506.01413 (cs)

[提交于 2025年6月2日 (v1) ，最后修订 2025年6月18日 (此版本， v4)]

标题：激励推理以实现大型语言模型的高级指令跟随

标题： Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

Authors:Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun

摘要：现有大型语言模型（LLMs）在遵循复杂指令时面临挑战，尤其是在存在多个约束条件且这些约束以平行、链式和分支结构组织时。一种直观的解决方案即思维链（CoT），有望普遍提升LLMs的能力。然而，我们发现，原始的CoT由于其简单的指令改写推理模式，对性能产生了负面影响。它无法揭示约束的组成以识别跨类型和维度层次的关系。为此，我们提出了一种系统性方法，通过激励测试时计算扩展的推理来增强LLMs处理复杂指令的能力。首先，我们基于现有分类法分解复杂指令，并提出可重复的数据获取方法。其次，我们利用基于可验证规则奖励信号的强化学习（RL）来专门培养指令跟随的推理能力。我们通过样本对比解决复杂指令下浅层、非本质的推理问题，以实现更优的CoT执行。我们还利用专家的行为克隆来促进快速思考的LLMs向熟练推理者分布偏移。在七个全面基准上的广泛评估证实了所提方法的有效性，在1.5B LLM上实现了与8B LLM相当的性能提升11.74%。代码和数据将在之后提供（正在评审中）。关键词：基于可验证奖励的强化学习（RLVR）、指令跟随、复杂指令

摘要： Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Codes and data will be available later (under review). Keywords: reinforcement learning with verifiable rewards (RLVR), instruction following, complex instructions

评论：	正文13页，3个表格，5个图表，附录45页
主题：	计算机视觉与模式识别 (cs.CV) ; 人工智能 (cs.AI); 计算与语言 (cs.CL); 机器学习 (cs.LG)
引用方式：	arXiv:2506.01413 [cs.CV]
	(或者 arXiv:2506.01413v4 [cs.CV] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.01413

提交历史

来自： Yulei Qin [查看电子邮件]
[v1] 星期一， 2025 年 6 月 2 日 08:11:44 UTC (35,059 KB)
[v2] 星期四， 2025 年 6 月 12 日 13:57:57 UTC (36,507 KB)
[v3] 星期一， 2025 年 6 月 16 日 07:40:34 UTC (36,509 KB)
[v4] 星期三， 2025 年 6 月 18 日 00:33:20 UTC (36,509 KB)

计算机科学 > 计算机视觉与模式识别

标题：激励推理以实现大型语言模型的高级指令跟随

标题： Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算机视觉与模式识别

标题： 激励推理以实现大型语言模型的高级指令跟随 显示英文标题

标题： Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：激励推理以实现大型语言模型的高级指令跟随