A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback

Duan, Guoliang; Liu, Mingwei; Wang, Yanlin; Wang, Chong; Peng, Xin; Zheng, Zibin

计算机科学 > 软件工程

arXiv:2507.00699 (cs)

[提交于 2025年7月1日 ]

标题：基于多轮反馈的细粒度代码指令遵循的分层可扩展基准

标题： A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback

Authors:Guoliang Duan, Mingwei Liu, Yanlin Wang, Chong Wang, Xin Peng, Zibin Zheng

摘要：大型语言模型（LLMs）在代码生成方面取得了显著进展，但它们在遵循具有分层和多样化约束的复杂编程指令方面的能力仍缺乏深入研究。现有的基准测试通常优先考虑功能正确性，忽视了实际开发中复杂的细微要求。我们引入了MultiCodeIF，这是一个全面的基准测试，旨在从多个维度评估代码生成中的指令遵循能力：约束类型、层级水平和迭代优化。基于一个包含9个类别和27种约束类型的结构化分类法， MultiCodeIF能够对功能性和非功能性指令遵循进行细致评估。使用自动化流程ConstraGen，我们合成并演化了来自14种编程语言的2,021个代码任务，通过反馈驱动的任务变体支持多轮评估。对六种最先进的LLM进行的实证评估揭示了显著的性能差异。表现最好的模型Claude-3-7-Sonnet在平均约束满足度上达到63.0%，而较小的模型如Qwen3-1.7B则降至 44.8%。模型在显式约束上表现良好，但在隐式或抽象约束上遇到困难。具有多个层级约束的任务显著降低了模型的成功率，从单层情况下的54.5%下降到多层情况下的仅 18.8%。然而，结构化的反馈可以实现逐步改进：平均约束满足度在四轮迭代优化中从63.0%提升至 83.4%。 MultiCodeIF提供了一个可扩展、约束感知且反馈敏感的框架，在现实代码生成场景下对LLM进行基准测试，弥合了合成评估与真实世界指令复杂性之间的差距。完整的基准数据集、评估流程和源代码可在https://github.com/SYSUSELab/MultiCodeIF获取。

摘要： Large language models (LLMs) have advanced significantly in code generation, yet their ability to follow complex programming instructions with layered and diverse constraints remains underexplored. Existing benchmarks often prioritize functional correctness, overlooking the nuanced requirements found in real-world development. We introduce MultiCodeIF, a comprehensive benchmark designed to evaluate instruction-following in code generation across multiple dimensions: constraint type, hierarchical levels, and iterative refinement. Built upon a structured taxonomy of 9 categories and 27 constraint types, MultiCodeIF enables granular assessment of both functional and non-functional instruction adherence. Using an automated pipeline, ConstraGen, we synthesize and evolve 2,021 code tasks sourced from 14 programming languages, supporting multi-turn evaluation through feedback-driven task variants. Empirical evaluation of six state-of-the-art LLMs uncovers substantial performance disparities. The top-performing model, Claude-3-7-Sonnet, achieves 63.0% average constraint satisfaction, while smaller models like Qwen3-1.7B fall to 44.8%. Models perform well on explicit constraints, but struggle with implicit or abstract constraints. Tasks with multiple hierarchical constraints significantly reduce model success rates, from 54.5% in single-level to just 18.8% in multi-level scenarios. However, structured feedback enables progressive improvement: average constraint satisfaction rises from 63.0% to 83.4% over four iterative refinement rounds. MultiCodeIF provides a scalable, constraint-aware, and feedback-sensitive framework to benchmark LLMs under realistic code generation scenarios, bridging the gap between synthetic evaluations and real-world instruction complexity. The full benchmark dataset, evaluation pipeline, and source code are available at https://github.com/SYSUSELab/MultiCodeIF.

主题：	软件工程 (cs.SE)
引用方式：	arXiv:2507.00699 [cs.SE]
	(或者 arXiv:2507.00699v1 [cs.SE] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.00699

提交历史

来自： Guoliang Duan [查看电子邮件]
[v1] 星期二， 2025 年 7 月 1 日 11:51:40 UTC (3,412 KB)

计算机科学 > 软件工程

标题：基于多轮反馈的细粒度代码指令遵循的分层可扩展基准

标题： A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 软件工程

标题： 基于多轮反馈的细粒度代码指令遵循的分层可扩展基准 显示英文标题

标题： A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基于多轮反馈的细粒度代码指令遵循的分层可扩展基准