Benchmarking LLMs for Unit Test Generation from Real-World Functions

Huang, Dong; Zhang, Jie M.; Harman, Mark; Zhang, Qianru; Du, Mingzhe; Ng, See-Kiong

计算机科学 > 软件工程

arXiv:2508.00408v1 (cs)

[提交于 2025年8月1日 ]

标题：针对真实世界函数的单元测试生成的LLMs基准测试

标题： Benchmarking LLMs for Unit Test Generation from Real-World Functions

Authors:Dong Huang, Jie M. Zhang, Mark Harman, Qianru Zhang, Mingzhe Du, See-Kiong Ng

摘要：最近，大型语言模型（LLMs）在自动化单元测试生成方面展现出巨大的潜力，显著减少了开发人员所需的手动努力。为了有效评估LLMs在这一领域的能力，拥有一个设计良好的基准至关重要，该基准应准确反映现实场景并减轻常见陷阱。现有的LLM测试生成基准存在两个关键缺陷：数据污染和结构简单的函数代码。因此，我们通常无法依赖使用这些有限基准进行实证研究得出的科学结论的有效性。由于污染，提供的实证证据可能具有偏差，并且由于结构简单，可能无法推广到玩具程序之外。为了解决这些问题，我们引入了ULT（UnLeakedTestbench），这是一个专门针对从现实世界Python函数中进行函数级单元测试生成的新基准。 ULT通过多阶段的整理过程构建，确保高圈复杂度并减轻测试用例污染。包含3,909个精心挑选的函数级任务，ULT提供了对LLMs测试生成能力更真实和更具挑战性的评估。我们还提供了PLT（PreLeakedTestbench），一个与ULT具有泄漏测试的配对基准，旨在实现测试生成中记忆与推理的受控分析。我们的评估结果表明， ULT更具挑战性。例如，LLMs生成的测试用例在准确性、语句覆盖率、分支覆盖率和突变分数方面的平均值分别为41.32%、45.10%、30.22%和40.21%。这些结果明显低于TestEval上的相应指标（91.79%、92.18%、82.04%和49.69%）和PLT（47.07%、55.13%、40.07%和50.80%）。

摘要： Recently, large language models (LLMs) have shown great promise in automating unit test generation, significantly reducing the manual effort required by developers. To effectively evaluate the capabilities of LLMs in this domain, it is crucial to have a well-designed benchmark that accurately reflects real-world scenarios and mitigates common pitfalls. Existing LLM test generation benchmarks are limited by two critical drawbacks: data contamination and structurally simple function code. As a result, we often cannot rely on the validity of scientific conclusions drawn from empirical studies using these limited benchmarks. The empirical evidence presented may be biased due to contamination and may fail to generalize beyond toy programs due to structural simplicity. To address these problems, we introduce ULT (UnLeakedTestbench), a new benchmark specifically designed for function-level unit test generation from real-world Python functions. ULT is constructed through a multi-stage curation process that ensures high cyclomatic complexity and mitigates test case contamination. With 3,909 carefully selected function-level tasks, ULT provides a more realistic and challenging evaluation of LLMs' test generation capabilities. We also provide PLT (PreLeakedTestbench), a pair benchmark of ULT with leaked tests designed to enable a controlled analysis of memorization versus reasoning in test generation. Our evaluation results demonstrate that ULT is significantly more challenging. For example, test cases generated by LLMs only achieve 41.32\%, 45.10\%, 30.22\%, and 40.21\% for accuracy, statement coverage, branch coverage, and mutation score on average for all LLMs, respectively. These results are substantially lower than the corresponding metrics on TestEval (91.79\%, 92.18\%, 82.04\%, and 49.69\%) and PLT (47.07\%, 55.13\%, 40.07\%, and 50.80\%).

评论：	待审核
主题：	软件工程 (cs.SE) ; 计算与语言 (cs.CL)
引用方式：	arXiv:2508.00408 [cs.SE]
	(或者 arXiv:2508.00408v1 [cs.SE] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.00408

提交历史

来自： Huang Dong [查看电子邮件]
[v1] 星期五， 2025 年 8 月 1 日 08:08:26 UTC (140 KB)

计算机科学 > 软件工程

标题：针对真实世界函数的单元测试生成的LLMs基准测试

标题： Benchmarking LLMs for Unit Test Generation from Real-World Functions

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 软件工程

标题： 针对真实世界函数的单元测试生成的LLMs基准测试 显示英文标题

标题： Benchmarking LLMs for Unit Test Generation from Real-World Functions

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：针对真实世界函数的单元测试生成的LLMs基准测试