FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

Beniamini, Gal; Dor, Yuval; Vinnikov, Alon; Peled, Shir Granot; Weinstein, Or; Sharir, Or; Wies, Noam; Nussbaum, Tomer; Shaul, Ido Ben; Zekharya, Tomer; Levine, Yoav; Shalev-Shwartz, Shai; Shashua, Amnon

计算机科学 > 人工智能

arXiv:2507.13337 (cs)

[提交于 2025年7月17日 ]

标题：公式一：超越竞赛编程的算法推理深度测量

标题： FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

Authors:Gal Beniamini, Yuval Dor, Alon Vinnikov, Shir Granot Peled, Or Weinstein, Or Sharir, Noam Wies, Tomer Nussbaum, Ido Ben Shaul, Tomer Zekharya, Yoav Levine, Shai Shalev-Shwartz, Amnon Shashua

摘要：前沿AI模型展示了强大的知识广度。但它们距离真正的专家级——或超人类——专业知识还有多远？真正的专家可以解决最困难的问题，并推动科学理解的边界。为了阐明前沿模型能力的局限性，我们远离了人为的编程竞赛难题，而是专注于现实生活中的研究问题。我们构建了FormulaOne，这是一个位于图论、逻辑和算法交叉点的基准测试，这完全在前沿模型的训练分布范围内。我们的问题非常具有挑战性，需要一系列推理步骤。该数据集有三个关键特性。首先，它具有商业价值，与实际的大规模优化问题相关，例如路由、调度和网络设计中出现的问题。其次，它是从图上的单态二阶（MSO）逻辑的高度表达性框架生成的，为大规模自动问题生成铺平了道路；这对于构建强化学习环境非常理想。第三，我们许多问题与理论计算机科学的前沿密切相关，并与其中的核心猜想有关，例如强指数时间假设（SETH）。因此，任何在我们数据集上超越已知结果的重大算法进展都可能具有深远的理论意义。值得注意的是，最先进的模型如OpenAI的o3在FormulaOne上完全失败，即使给予10次尝试和解释性的少量示例，也只能解决不到1%的问题——这突显了在某些领域它们仍距离专家级理解相距甚远。为了支持进一步的研究，我们还整理了FormulaOne-Warmup，提供了一组来自相同分布的简单任务。我们发布了完整的语料库以及一个全面的评估框架。

摘要： Frontier AI models demonstrate formidable breadth of knowledge. But how close are they to true human -- or superhuman -- expertise? Genuine experts can tackle the hardest problems and push the boundaries of scientific understanding. To illuminate the limits of frontier model capabilities, we turn away from contrived competitive programming puzzles, and instead focus on real-life research problems. We construct FormulaOne, a benchmark that lies at the intersection of graph theory, logic, and algorithms, all well within the training distribution of frontier models. Our problems are incredibly demanding, requiring an array of reasoning steps. The dataset has three key properties. First, it is of commercial interest and relates to practical large-scale optimisation problems, such as those arising in routing, scheduling, and network design. Second, it is generated from the highly expressive framework of Monadic Second-Order (MSO) logic on graphs, paving the way toward automatic problem generation at scale; ideal for building RL environments. Third, many of our problems are intimately related to the frontier of theoretical computer science, and to central conjectures therein, such as the Strong Exponential Time Hypothesis (SETH). As such, any significant algorithmic progress on our dataset, beyond known results, could carry profound theoretical implications. Remarkably, state-of-the-art models like OpenAI's o3 fail entirely on FormulaOne, solving less than 1% of the questions, even when given 10 attempts and explanatory fewshot examples -- highlighting how far they remain from expert-level understanding in some domains. To support further research, we additionally curate FormulaOne-Warmup, offering a set of simpler tasks, from the same distribution. We release the full corpus along with a comprehensive evaluation framework.

主题：	人工智能 (cs.AI) ; 计算复杂性 (cs.CC); 逻辑 (math.LO)
引用方式：	arXiv:2507.13337 [cs.AI]
	(或者 arXiv:2507.13337v1 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.13337

提交历史

来自： Gal Beniamini [查看电子邮件]
[v1] 星期四， 2025 年 7 月 17 日 17:53:55 UTC (387 KB)

计算机科学 > 人工智能

标题：公式一：超越竞赛编程的算法推理深度测量

标题： FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： 公式一：超越竞赛编程的算法推理深度测量 显示英文标题

标题： FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：公式一：超越竞赛编程的算法推理深度测量