Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples

Yu, Fangxu; Jiang, Lai; Kang, Haoqiang; Hao, Shibo; Qin, Lianhui

计算机科学 > 人工智能

arXiv:2406.05673v6 (cs)

[提交于 2024年6月9日 (v1) ，最后修订 2025年5月27日 (此版本， v6)]

标题：推理流程：使用最少示例训练大语言模型进行发散推理

标题： Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples

Authors:Fangxu Yu, Lai Jiang, Haoqiang Kang, Shibo Hao, Lianhui Qin

摘要：生成给定问题的多种解决方案的能力是人类创造力的标志。这种发散性推理对机器也很重要，可以增强其鲁棒性，并使其在许多应用中协助人类，例如科学发现。然而，现有的基于大型语言模型（LLMs）的多步骤推理方法主要关注推理准确性，而没有进一步发现更多多样化的有效解决方案。例如，监督微调可以提高推理质量，但需要大量标记数据，而以奖励最大化为目标的强化学习则找到高奖励解决方案，而忽略了解决方案的多样性。为填补这一空白，我们提出了推理流（FoR），一种高效的多样性探索的LLM微调方法，旨在用最少的数据提高推理质量和多样性。 FoR将多步骤LLM推理形式化为DAG结构的推理图上的马尔可夫流。这种表述使我们能够纳入并适应原则性的GFlowNet方法，用于微调LLM以按目标问题的（未归一化）奖励概率采样发散路径。大量实验表明，使用有限的训练示例（例如15个示例），FoR能够发现多样化、有创意、高质量的解决方案，在六个具有挑战性的推理任务中显著优于各种现有的推理和训练方法，包括BlocksWorld（具身推理）、Game24（数学谜题解决）、Rubik's Cube（空间推理）、1D-ARC（抽象推理）、GSM8k（数学推理）和ProntoQA（逻辑推理）。代码可在https://github.com/Yu-Fangxu/FoR获取。

摘要： The ability to generate diverse solutions to a given problem is a hallmark of human creativity. This divergent reasoning is also crucial for machines, enhancing their robustness and enabling them to assist humans in many applications such as scientific discovery. However, existing approaches to multi-step reasoning with large language models (LLMs) have mostly focused only on reasoning accuracy, without further discovering more diverse valid solutions. For example, supervised fine-tuning improves reasoning quality but requires vast labeled data, while reward-maximizing reinforcement learning finds top-reward solutions while neglecting the solution diversity. To fill this gap, we propose Flow of Reasoning (FoR), an efficient diversity-seeking LLM finetuning method aimed at improving reasoning quality and diversity with minimal data. FoR formulates multi-step LLM reasoning as a Markovian flow on a DAG-structured reasoning graph. This formulation allows us to incorporate and adapt principled GFlowNet approaches, for finetuning LLMs to sample divergent paths with probabilities proportional to the (unnormalized) reward of target problems. Extensive experiments show that, with limited training examples (e.g., 15 examples), FoR enables the discovery of diverse, creative, high-quality solutions, greatly outperforming a wide range of existing inference and training methods across six challenging reasoning tasks, including BlocksWorld (embodied reasoning), Game24 (math puzzle solving), Rubik's Cube (spatial reasoning), 1D-ARC (abstraction reasoning), GSM8k (math reasoning), and ProntoQA (logical reasoning). Code is available at https://github.com/Yu-Fangxu/FoR.

评论：	被ICML 2025接收
主题：	人工智能 (cs.AI) ; 计算与语言 (cs.CL)
引用方式：	arXiv:2406.05673 [cs.AI]
	(或者 arXiv:2406.05673v6 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.2406.05673

提交历史

来自： Fangxu Yu [查看电子邮件]
[v1] 星期日， 2024 年 6 月 9 日 07:06:58 UTC (313 KB)
[v2] 星期一， 2024 年 6 月 24 日 15:49:09 UTC (313 KB)
[v3] 星期五， 2024 年 10 月 4 日 15:14:55 UTC (314 KB)
[v4] 星期五， 2025 年 2 月 21 日 16:17:17 UTC (675 KB)
[v5] 星期六， 2025 年 3 月 8 日 13:10:25 UTC (690 KB)
[v6] 星期二， 2025 年 5 月 27 日 03:51:13 UTC (469 KB)

计算机科学 > 人工智能

标题：推理流程：使用最少示例训练大语言模型进行发散推理

标题： Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： 推理流程：使用最少示例训练大语言模型进行发散推理 显示英文标题

标题： Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：推理流程：使用最少示例训练大语言模型进行发散推理