SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments

Arora, Avi; Jang, Jinu; Moghaddam, Roshanak Zilouchian

计算机科学 > 软件工程

arXiv:2507.09063v1 (cs)

[提交于 2025年7月11日 ]

标题： SetupBench：评估软件工程代理启动开发环境的能力

标题： SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments

Authors:Avi Arora, Jinu Jang, Roshanak Zilouchian Moghaddam

摘要：现代大型语言模型（LLM）代理承诺为现实世界的软件任务提供端到端的帮助，但现有的基准测试几乎只在预设环境中评估LLM代理，其中每个依赖项都已预先安装。为了填补这一空白，我们引入了SetupBench，这是一个包含93个实例的基准测试，它隔离了环境启动技能：从一个干净的Linux沙箱开始，代理必须安装包、解决依赖冲突、初始化数据库并配置后台服务。我们的任务涵盖了七个语言生态系统、五个数据库引擎和多服务编排场景，每个任务都配有自然语言问题陈述和确定性成功命令。通过评估OpenHands，一个最先进的编码代理，我们发现各个任务类别中的成功率较低，特别是在仓库设置（38.9-57.4%）和本地数据库配置（20.0-53.3%）方面存在特别大的挑战。我们的分析揭示了系统性的失败模式，包括不完整的开发工具安装、幻觉任务约束以及非持久性环境修改，这些都会破坏代理与人类协作的工作流程。我们发现代理探索策略中存在显著的低效率，与最佳人类行为相比，38-89%的操作是不必要的。这些发现突显了当前代理在实际环境启动能力方面的差距。通过针对这一关键但评估不足的能力，SetupBench为下一代旨在解决端到端现实任务的软件开发代理提供了一个严格的衡量标准。

摘要： Modern Large Language Model (LLM) agents promise end to end assistance with real-world software tasks, yet existing benchmarks evaluate LLM agents almost exclusively in pre-baked environments where every dependency is pre-installed. To fill this gap, we introduce SetupBench, a 93 instance benchmark that isolates the environment-bootstrap skill: starting from a bare Linux sandbox, an agent must install packages, resolve dependency conflicts, initialize databases, and configure background services. Our tasks span seven language ecosystems, five database engines, and multi-service orchestration scenarios, each accompanies by a natural language problem statement and a deterministic success command. Through evaluation of OpenHands, a state-of-the-art coding agent, we find low success rates across task categories, with particular challenges in repository setup (38.9-57.4%) and local database configuration (20.0-53.3%). Our analysis reveals systematic failure modes including incomplete development tooling installation, hallucinated task constraints, and non-persistent environment modifications that break agent-human collaboration workflows. We identify substantial inefficiencies in agent exploration strategies, with 38-89% of actions being unnecessary compared to optimal human behavior. These findings highlight gaps in current agents' practical environment-bootstrap capabilities. By targeting this critical yet under-evaluated capability, SetupBench provides a rigorous yard-stick for the next generation of software developer agents aiming to solve end to end real-wold tasks.

主题：	软件工程 (cs.SE) ; 人工智能 (cs.AI); 机器学习 (cs.LG)
引用方式：	arXiv:2507.09063 [cs.SE]
	(或者 arXiv:2507.09063v1 [cs.SE] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.09063

提交历史

来自： Roshanak Zilouchian Moghaddam [查看电子邮件]
[v1] 星期五， 2025 年 7 月 11 日 22:45:07 UTC (38 KB)

计算机科学 > 软件工程

标题： SetupBench：评估软件工程代理启动开发环境的能力

标题： SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 软件工程

标题： SetupBench：评估软件工程代理启动开发环境的能力 显示英文标题

标题： SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： SetupBench：评估软件工程代理启动开发环境的能力