DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

Li, Yinsheng; Dong, Zhen; Shao, Yi

计算机科学 > 人工智能

arXiv:2507.11527 (cs)

[提交于 2025年7月15日 ]

标题： DrafterBench：用于土木工程任务自动化的大型语言模型基准测试

标题： DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

Authors:Yinsheng Li, Zhen Dong, Yi Shao

摘要：大型语言模型（LLM）代理在解决现实问题方面表现出巨大的潜力，并有望成为工业任务自动化的一种解决方案。然而，需要更多的基准测试来从工业角度系统地评估自动化代理，例如在土木工程中。因此，我们提出了DrafterBench，用于在技术图纸修订的背景下对LLM代理进行全面评估，这是土木工程中的一项表示任务。 DrafterBench包含从实际图纸文件中总结出的十二种任务，具有46个自定义功能/工具，总共1920个任务。 DrafterBench是一个开源基准，用于严格测试AI代理在解释复杂和长上下文指令、利用先验知识以及通过隐式策略意识适应动态指令质量方面的熟练程度。该工具包全面评估结构化数据理解、功能执行、指令遵循和批判性推理等方面的不同能力。 DrafterBench提供任务准确性及错误统计的详细分析，旨在深入洞察代理能力，并确定在工程应用中集成LLM的改进目标。我们的基准可在https://github.com/Eason-Li-AIS/DrafterBench获取，测试集托管于https://huggingface.co/datasets/Eason666/DrafterBench。

摘要： Large Language Model (LLM) agents have shown great potential for solving real-world problems and promise to be a solution for tasks automation in industry. However, more benchmarks are needed to systematically evaluate automation agents from an industrial perspective, for example, in Civil Engineering. Therefore, we propose DrafterBench for the comprehensive evaluation of LLM agents in the context of technical drawing revision, a representation task in civil engineering. DrafterBench contains twelve types of tasks summarized from real-world drawing files, with 46 customized functions/tools and 1920 tasks in total. DrafterBench is an open-source benchmark to rigorously test AI agents' proficiency in interpreting intricate and long-context instructions, leveraging prior knowledge, and adapting to dynamic instruction quality via implicit policy awareness. The toolkit comprehensively assesses distinct capabilities in structured data comprehension, function execution, instruction following, and critical reasoning. DrafterBench offers detailed analysis of task accuracy and error statistics, aiming to provide deeper insight into agent capabilities and identify improvement targets for integrating LLMs in engineering applications. Our benchmark is available at https://github.com/Eason-Li-AIS/DrafterBench, with the test set hosted at https://huggingface.co/datasets/Eason666/DrafterBench.

评论：	项目页面：https://github.com/Eason-Li-AIS/DrafterBench
主题：	人工智能 (cs.AI) ; 计算工程、金融与科学 (cs.CE)
引用方式：	arXiv:2507.11527 [cs.AI]
	(或者 arXiv:2507.11527v1 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.11527

提交历史

来自： Yinsheng Li [查看电子邮件]
[v1] 星期二， 2025 年 7 月 15 日 17:56:04 UTC (19,673 KB)

计算机科学 > 人工智能

标题： DrafterBench：用于土木工程任务自动化的大型语言模型基准测试

标题： DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： DrafterBench：用于土木工程任务自动化的大型语言模型基准测试 显示英文标题

标题： DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： DrafterBench：用于土木工程任务自动化的大型语言模型基准测试