Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

Hariharan, Kaivalya; Girit, Uzay; Wang, Atticus; Andreas, Jacob

计算机科学 > 机器学习

arXiv:2506.00172 (cs)

[提交于 2025年5月30日 ]

标题：断点：大型语言模型代码代理的系统级推理可扩展评估

标题： Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

Authors:Kaivalya Hariharan, Uzay Girit, Atticus Wang, Jacob Andreas

摘要：大型语言模型（LLMs）的基准测试大多评估了短期、局部化的推理能力。现有的长期基准测试套件（例如 SWE-bench）依赖于人工精心策划的问题，因此扩展或调整难度需要昂贵的人力投入，并且评估很快就会达到饱和。然而，许多现实世界中的任务，如软件工程或科学研究，要求智能体能够快速理解并动态操纵新颖复杂的结构；评估这些能力需要构建大量且多样化的任务集供智能体解决。我们引入了 Breakpoint，这是一种通过对抗性地破坏真实世界软件存储库中的函数来自动生成代码修复任务的基准测试方法。Breakpoint 沿着两个明确的维度系统性地控制任务难度：局部推理（由代码复杂度指标如圈复杂度表征）和系统级推理（由调用图中心性和同时被破坏的相互依赖函数的数量表征）。在超过 900 个生成的任务实验中，我们证明了该方法可以扩展到任意难度，最先进的模型的成功率从最简单任务的 55% 下降到最难任务的 0%。

摘要： Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning. Existing long-horizon suites (e.g. SWE-bench) rely on manually curated issues, so expanding or tuning difficulty demands expensive human effort and evaluations quickly saturate. However, many real-world tasks, such as software engineering or scientific research, require agents to rapidly comprehend and manipulate novel, complex structures dynamically; evaluating these capabilities requires the ability to construct large and varied sets of problems for agents to solve. We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions within real-world software repositories. Breakpoint systematically controls task difficulty along two clear dimensions: local reasoning (characterized by code complexity metrics such as cyclomatic complexity) and system-level reasoning (characterized by call-graph centrality and the number of simultaneously corrupted interdependent functions). In experiments across more than 900 generated tasks we demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55% on the easiest tasks down to 0% on the hardest.

评论：	21页，14幅图
主题：	机器学习 (cs.LG)
引用方式：	arXiv:2506.00172 [cs.LG]
	(或者 arXiv:2506.00172v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.00172

提交历史

来自： Kaivalya Hariharan [查看电子邮件]
[v1] 星期五， 2025 年 5 月 30 日 19:23:51 UTC (4,150 KB)

计算机科学 > 机器学习

标题：断点：大型语言模型代码代理的系统级推理可扩展评估

标题： Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 断点：大型语言模型代码代理的系统级推理可扩展评估 显示英文标题

标题： Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：断点：大型语言模型代码代理的系统级推理可扩展评估