On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization

Crupi, Giuseppe; Tufano, Rosalia; Velasco, Alejandro; Mastropaolo, Antonio; Poshyvanyk, Denys; Bavota, Gabriele

计算机科学 > 软件工程

arXiv:2507.16587v1 (cs)

[提交于 2025年7月22日 ]

标题：关于LLM作为评判者在代码生成和摘要中的有效性

标题： On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization

Authors:Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, Gabriele Bavota

摘要：大型语言模型最近被用作复杂自然语言处理任务的评判者，例如问答。基本思想是将评估自动化技术在特定任务中的“质量”委托给大型语言模型，这些任务具有以下特点：(i) 定量指标只能部分反映实际情况，以及；(ii) 大规模的人工评估成本太高。如果证明大型语言模型作为评判者对特定任务有效，那么它也可以为自动化带来新的可能性，多个大型语言模型为给定任务实例提出解决方案，其他模型则进行评判并决定向用户展示的最佳输出。我们研究了大型语言模型作为评判者在两个与代码相关的任务中的有效性，即代码生成和代码摘要。选择这些任务的原因有两个。首先，定量指标通常不足以评估代码摘要器/生成器。例如，有大量文献表明，像BLEU这样的指标对于生成摘要的质量来说是一个相当弱的代理。其次，即使是最先进的技术仍然难以处理这些任务的复杂实例，使它们成为受益于更高级解决方案的候选者，这些解决方案设想了大型语言模型之间的协作。对于代码生成，我们检查八个大型语言模型是否能够判断由同一大型语言模型生成或由人类实现的1,405个Java方法和1,281个Python函数的正确性。对于代码摘要，我们将五个大型语言模型的判断与九个人提供的判断进行比较，涉及约1.2k个与Java和Python函数相关的摘要。我们的研究结果表明，在这两个任务中，GPT-4-turbo是评判能力最强的大型语言模型，而参数数量为数十亿的“较小”大型语言模型无法应对评判任务。然而，即使是表现最好的大型语言模型也经常错误地判断代码的正确性和摘要的质量。

摘要： Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A. The basic idea is to delegate to an LLM the assessment of the "quality" of the output provided by an automated technique for tasks for which: (i) quantitative metrics would only tell part of the story, and; (ii) a large-scale human-based evaluation would be too expensive. LLMs-as-a-judge, if proven effective for a specific task, can also unlock new possibilities for automation, with several LLMs proposing a solution for a given instance of the task and others judging and deciding what is the best output to show the user. We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization. The rationale for choosing these tasks is two-fold. First, quantitative metrics are usually not enough for the assessment of code summarizers/generators. For example, it is well documented that metrics such as BLEU are quite weak proxies for the quality of the generated summaries. Second, even state-of-the-art techniques still struggle with handling complex instances of these tasks, making them good candidates for benefiting from more advanced solutions envisioning collaboration among LLMs. For code generation, we check whether eight LLMs are able to judge the correctness of 1,405 Java methods and 1,281 Python functions generated by the same LLMs or implemented by humans. For code summarization, we compare the judgment of five LLMs to those provided by nine humans for ~1.2k summaries, related to both Java and Python functions. Our findings show that GPT-4-turbo is the best LLM in terms of judging capabilities for both tasks, with "smaller" LLMs featuring tens of billions parameters not being able to cope with judging tasks. However, even the best-performing LLM frequently misjudges the correctness of the code and summary quality.

评论：	被TSE接收。IEEE软件工程汇刊
主题：	软件工程 (cs.SE)
引用方式：	arXiv:2507.16587 [cs.SE]
	(或者 arXiv:2507.16587v1 [cs.SE] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.16587

提交历史

来自： Giuseppe Crupi [查看电子邮件]
[v1] 星期二， 2025 年 7 月 22 日 13:40:26 UTC (389 KB)

计算机科学 > 软件工程

标题：关于LLM作为评判者在代码生成和摘要中的有效性

标题： On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 软件工程

标题： 关于LLM作为评判者在代码生成和摘要中的有效性 显示英文标题

标题： On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：关于LLM作为评判者在代码生成和摘要中的有效性