Evaluating LLMs on Sequential API Call Through Automated Test Generation

Huang, Yuheng; Song, Da; Ji, Zhenlan; Wang, Shuai; Ma, Lei

计算机科学 > 软件工程

arXiv:2507.09481 (cs)

[提交于 2025年7月13日 ]

标题：评估LLMs在顺序API调用上的自动化测试生成

标题： Evaluating LLMs on Sequential API Call Through Automated Test Generation

Authors:Yuheng Huang, Da Song, Zhenlan Ji, Shuai Wang, Lei Ma

摘要：通过整合外部API的工具，大型语言模型（LLMs）在各种复杂的现实任务中扩展了其有前途的能力。然而，对LLM工具使用的测试、评估和分析仍处于早期阶段。大多数现有的基准测试依赖于手动收集的测试用例，其中许多无法自动检查语义正确性，而是依赖于静态方法，如字符串匹配。此外，这些基准常常忽略了序列API调用之间发生的复杂交互，而这种交互在现实应用中很常见。为了填补这一空白，本文我们介绍了StateGen，一个自动化框架，旨在生成涉及序列API交互的多样化编码任务。 StateGen结合了基于状态机的API约束求解和验证、基于能量的采样以及控制流注入，以生成可执行程序。然后通过两个LLM代理的合作，将这些程序翻译成类似人类的自然语言任务描述。利用StateGen，我们构建了StateEval，一个涵盖120个经过验证的测试用例的基准，覆盖三个代表性场景：Session Service、Tensor Operation和ElevenLabs MCP。实验结果证实，StateGen能够有效生成具有挑战性和现实意义的面向API的任务，突出了当前集成API的LLMs需要改进的领域。

摘要： By integrating tools from external APIs, Large Language Models (LLMs) have expanded their promising capabilities in a diverse spectrum of complex real-world tasks. However, testing, evaluation, and analysis of LLM tool use remain in their early stages. Most existing benchmarks rely on manually collected test cases, many of which cannot be automatically checked for semantic correctness and instead depend on static methods such as string matching. Additionally, these benchmarks often overlook the complex interactions that occur between sequential API calls, which are common in real-world applications. To fill the gap, in this paper, we introduce StateGen, an automated framework designed to generate diverse coding tasks involving sequential API interactions. StateGen combines state-machine-based API constraint solving and validation, energy-based sampling, and control-flow injection to generate executable programs. These programs are then translated into human-like natural language task descriptions through a collaboration of two LLM agents. Utilizing StateGen, we construct StateEval, a benchmark encompassing 120 verified test cases spanning across three representative scenarios: Session Service, Tensor Operation, and ElevenLabs MCP. Experimental results confirm that StateGen can effectively generate challenging and realistic API-oriented tasks, highlighting areas for improvement in current LLMs incorporating APIs.

主题：	软件工程 (cs.SE) ; 人工智能 (cs.AI); 计算与语言 (cs.CL)
引用方式：	arXiv:2507.09481 [cs.SE]
	(或者 arXiv:2507.09481v1 [cs.SE] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.09481

提交历史

来自： Yuheng Huang [查看电子邮件]
[v1] 星期日， 2025 年 7 月 13 日 03:52:51 UTC (454 KB)

计算机科学 > 软件工程

标题：评估LLMs在顺序API调用上的自动化测试生成

标题： Evaluating LLMs on Sequential API Call Through Automated Test Generation

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 软件工程

标题： 评估LLMs在顺序API调用上的自动化测试生成 显示英文标题

标题： Evaluating LLMs on Sequential API Call Through Automated Test Generation

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：评估LLMs在顺序API调用上的自动化测试生成