Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Gao, Mingqi; Liu, Yixin; Hu, Xinyu; Wan, Xiaojun; Bragg, Jonathan; Cohan, Arman

计算机科学 > 计算与语言

arXiv:2501.00560 (cs)

[提交于 2024年12月31日 (v1) ，最后修订 2025年2月11日 (此版本， v2)]

标题：重新评估自动LLM系统排名以符合人类偏好

标题： Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Authors:Mingqi Gao, Yixin Liu, Xinyu Hu, Xiaojun Wan, Jonathan Bragg, Arman Cohan

摘要：评估和排名不同大语言模型（LLM）的能力对于理解它们的性能以及与人类偏好的对齐程度至关重要。由于人工评估成本高且耗时，自动LLM基准测试工具（即旨在根据LLM与人类偏好对齐程度对其排名的自动评估框架）是必不可少的。一个自动LLM基准测试工具由四个组件组成：输入集（例如用户指令）、评估模型（例如LLM）、评估类型（例如成对比较）和聚合方法（例如ELO评分系统）。然而，以往的研究尚未深入探讨如何选择这些组件，以及它们的不同组合如何影响结果。在本工作中，通过控制实验，我们提供了一系列建议，说明如何选择每个组件以更好地实现LLM评估的自动化。此外，我们发现当使用自动LLM基准测试工具评估性能相近的LLM时，其性能会急剧下降，这突显了当前基准测试工具的局限性，并呼吁未来的研究工作。最后，我们发现评估模型在实例层面的性能（例如选择最佳输出的准确性）并不总能与其作为基准测试工具组件时的有效性保持一致，这突显了对基准测试工具进行专门系统级评估的重要性。

摘要： Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human evaluations, an automatic LLM bencher (i.e., an automatic evaluation framework that aims to rank LLMs based on their alignment with human preferences) is indispensable. An automatic LLM bencher consists of four components: the input set (e.g., a user instruction), the evaluation model (e.g., an LLM), the evaluation type (e.g., pairwise comparison), and the aggregation method (e.g., the ELO rating system). However, previous work has not thoroughly explored how to select these components or how their different combinations influence the results. In this work, through controlled experiments, we provide a series of recommendations on how to choose each component to better automate the evaluation of LLMs. Furthermore, we discovered that when evaluating LLMs with similar performance, the performance of the automatic LLM bencher declines sharply, underscoring the limitations of current benchers and calling for future work. Lastly, we found that the evaluation models' performance at the instance level (e.g., the accuracy of selecting the best output) does not always align with their effectiveness when used as a component of a bencher, highlighting the importance of dedicated system-level evaluation of benchers.

评论：	NAACL 2025的成果
主题：	计算与语言 (cs.CL) ; 人工智能 (cs.AI); 机器学习 (cs.LG)
引用方式：	arXiv:2501.00560 [cs.CL]
	(或者 arXiv:2501.00560v2 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.00560

提交历史

来自： Mingqi Gao [查看电子邮件]
[v1] 星期二， 2024 年 12 月 31 日 17:46:51 UTC (401 KB)
[v2] 星期二， 2025 年 2 月 11 日 10:02:55 UTC (406 KB)

计算机科学 > 计算与语言

标题：重新评估自动LLM系统排名以符合人类偏好

标题： Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： 重新评估自动LLM系统排名以符合人类偏好 显示英文标题

标题： Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：重新评估自动LLM系统排名以符合人类偏好