LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

Hashemi, Helia; Eisner, Jason; Rosset, Corby; Van Durme, Benjamin; Kedzie, Chris

doi:10.18653/v1/2024.acl-long.745

计算机科学 > 计算与语言

arXiv:2501.00274 (cs)

[提交于 2024年12月31日 ]

标题： LLM-Rubric：一种多维的、校准的自动化评估自然语言文本的方法

标题： LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

Authors:Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, Chris Kedzie

摘要：本文介绍了一种用于自动评估自然语言文本的框架。人工构建的评分标准描述了如何评估多个感兴趣维度。为了评估一篇文本，大型语言模型（LLM）会针对每个评分标准问题进行提示，并生成对潜在回答的分布。 LLM的预测往往与人类评判者的意见不一致——事实上，人类评判者之间也不完全一致。然而，多个LLM分布可以与每个人类评判者的标注进行$\textit{combined}$到$\textit{predict}$，包括一个评估整体质量或相关性的总结问题。 LLM-Rubric通过训练一个包含评判者特定和评判者独立参数的小型前馈神经网络来实现这一点。在人类-AI信息检索任务中评估对话系统时，我们发现使用9个问题（评估自然性、简洁性和引用质量等维度）的LLM-Rubric在1-4的尺度上预测人类评判者对整体用户满意度的评估，其均方根误差为$< 0.5$，比未校准的基线提高了$2\times$。

摘要： This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses. The LLM predictions often fail to agree well with human judges -- indeed, the humans do not fully agree with one another. However, the multiple LLM distributions can be $\textit{combined}$ to $\textit{predict}$ each human judge's annotations on all questions, including a summary question that assesses overall quality or relevance. LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that LLM-Rubric with 9 questions (assessing dimensions such as naturalness, conciseness, and citation quality) predicts human judges' assessment of overall user satisfaction, on a scale of 1--4, with RMS error $< 0.5$, a $2\times$ improvement over the uncalibrated baseline.

评论：	2024年6月17日更新版
主题：	计算与语言 (cs.CL)
ACM 类：	I.2.1; I.2.6; I.2.7
引用方式：	arXiv:2501.00274 [cs.CL]
	(或者 arXiv:2501.00274v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.00274
期刊参考：	Proceedings of ACL 2024 (Volume 1: Long Papers), pp. 13806-13834
相关 DOI:	https://doi.org/10.18653/v1/2024.acl-long.745

提交历史

来自： Jason Eisner [查看电子邮件]
[v1] 星期二， 2024 年 12 月 31 日 04:57:01 UTC (606 KB)

计算机科学 > 计算与语言

标题： LLM-Rubric：一种多维的、校准的自动化评估自然语言文本的方法

标题： LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： LLM-Rubric：一种多维的、校准的自动化评估自然语言文本的方法 显示英文标题

标题： LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： LLM-Rubric：一种多维的、校准的自动化评估自然语言文本的方法