To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization

Wang, Haozhe; Li, Long; Qu, Chao; Zhu, Fengming; Xu, Weidi; Chu, Wei; Lin, Fangzhen

Computer Science > Artificial Intelligence

arXiv:2502.00691 (cs)

[Submitted on 2 Feb 2025 (v1) , last revised 18 Jul 2025 (this version, v4)]

Title: To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization

Title: 是否编码？通过期望最大化进行数学语言模型的自适应工具集成

Authors:Haozhe Wang, Long Li, Chao Qu, Fengming Zhu, Weidi Xu, Wei Chu, Fangzhen Lin

Abstract: Recent advances in mathematical problem-solving with language models (LMs) integrate chain-of-thought (CoT) reasoning and code execution to harness their complementary strengths. However, existing hybrid frameworks exhibit a critical limitation: they depend on externally dictated instructions or rigid code-integration templates, lacking metacognitive awareness -- the capacity to dynamically evaluate intrinsic capabilities and autonomously determine when and how to integrate tools. This rigidity motivates our study of autonomous code integration, enabling models to adapt tool-usage strategies as their reasoning abilities evolve during training. While reinforcement learning (RL) shows promise for boosting LLM reasoning at scale (e.g., DeepSeek-R1), we demonstrate its inefficiency in learning autonomous code integration due to inadequate exploration of the vast combinatorial space of CoT-code interleaving patterns. To address this challenge, we propose a novel Expectation-Maximization (EM) framework that synergizes structured exploration (E-step) with off-policy RL optimization (M-step), creating a self-reinforcing cycle between metacognitive tool-use decisions and evolving capabilities. Experiments reveal our method achieves superior results through improved exploration. Notably, our 7B model improves over 11% on MATH500 and 9.4% on AIME without o1-like CoT.

Abstract: 最近在语言模型（LMs）进行数学问题求解方面的进展，结合了思维链（CoT）推理和代码执行，以利用它们的互补优势。然而，现有的混合框架存在一个关键限制：它们依赖于外部指定的指令或固定的代码集成模板，缺乏元认知意识——即动态评估内在能力并自主决定何时以及如何集成工具的能力。这种僵化促使我们研究自主代码集成，使模型能够随着训练过程中推理能力的演变而适应工具使用策略。虽然强化学习（RL）在大规模提升LLM推理方面显示出潜力（例如，DeepSeek-R1），但我们证明了其在学习自主代码集成方面的低效性，这是由于对CoT-代码交错模式的庞大组合空间探索不足。为了解决这一挑战，我们提出了一种新颖的期望最大化（EM）框架，将结构化探索（E-step）与离策略RL优化（M-step）相结合，在元认知工具使用决策和不断发展的能力之间创建一个自我强化的循环。实验表明，我们的方法通过改进探索取得了更优的结果。值得注意的是，我们的7B模型在MATH500上提升了超过11%，在AIME上提升了9.4%，且没有使用类似o1的CoT。

Comments:	Accepted to ACL 2025
Subjects:	Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2502.00691 [cs.AI]
	(or arXiv:2502.00691v4 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2502.00691

Submission history

From: Haozhe Wang [view email]
[v1] Sun, 2 Feb 2025 06:32:23 UTC (3,374 KB)
[v2] Sun, 16 Feb 2025 07:18:23 UTC (627 KB)
[v3] Thu, 22 May 2025 05:17:03 UTC (627 KB)
[v4] Fri, 18 Jul 2025 07:40:22 UTC (620 KB)

Computer Science > Artificial Intelligence

Title: To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization

Title: 是否编码？通过期望最大化进行数学语言模型的自适应工具集成

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title: To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization Show Chinese title

Title: 是否编码？ 通过期望最大化进行数学语言模型的自适应工具集成

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization

Title: 是否编码？通过期望最大化进行数学语言模型的自适应工具集成