EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta

Bernard, Raymond; Raza, Shaina; Das, Subhabrata; Murugan, Rahul

计算机科学 > 计算与语言

arXiv:2501.00257v1 (cs)

[提交于 2024年12月31日 ]

标题： EQUATOR：一种用于评估开放式问题中大语言模型推理的确定性框架。 # v1.0.0-beta

标题： EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta

Authors:Raymond Bernard, Shaina Raza (PhD), Subhabrata Das (PhD), Rahul Murugan

摘要：尽管大型语言模型（LLMs）表现出显著的一致性，现有的评估方法往往存在流利度偏差，并且高度依赖选择题格式，这使得难以有效评估事实准确性和复杂推理能力。因此，LLMs经常生成事实不准确的回应，特别是在复杂推理任务中，这突显了两个显著挑战：(1) 现有方法在有效评估推理和事实准确性方面的不足，以及 (2) 依赖人类评估者进行细微判断，如Williams和Huckle (2024)[1]所展示的，他们发现即使在自动评分进步的情况下，人工评分仍然是不可或缺的。为了解决开放性推理任务中的评估缺口，我们引入了EQUATOR评估器（开放性推理问题回答全面性的评估）。该框架结合了确定性评分，并注重事实准确性和稳健的推理评估。使用向量数据库，EQUATOR将开放性问题与人工评估的答案配对，从而实现更精确和可扩展的评估。实际上，与Williams和Huckle的(2004)[1]方法相比，EQUATOR显著减少了对人工评估者的依赖并提高了可扩展性。我们的结果表明，该框架在保持高准确率标准的同时，明显优于传统的选择题评估。此外，我们引入了一种利用较小的本地托管LLM的自动化评估过程。我们使用了LLaMA 3.2B，在Ollama二进制文件上运行以简化我们的评估。这项工作确立了评估LLM性能的新范式，强调事实准确性和推理能力，并为未来的研究提供了坚实的理论基础。

摘要： Despite the remarkable coherence of Large Language Models (LLMs), existing evaluation methods often suffer from fluency bias and rely heavily on multiple-choice formats, making it difficult to assess factual accuracy and complex reasoning effectively. LLMs thus frequently generate factually inaccurate responses, especially in complex reasoning tasks, highlighting two prominent challenges: (1) the inadequacy of existing methods to evaluate reasoning and factual accuracy effectively, and (2) the reliance on human evaluators for nuanced judgment, as illustrated by Williams and Huckle (2024)[1], who found manual grading indispensable despite automated grading advancements. To address evaluation gaps in open-ended reasoning tasks, we introduce the EQUATOR Evaluator (Evaluation of Question Answering Thoroughness in Open-ended Reasoning). This framework combines deterministic scoring with a focus on factual accuracy and robust reasoning assessment. Using a vector database, EQUATOR pairs open-ended questions with human-evaluated answers, enabling more precise and scalable evaluations. In practice, EQUATOR significantly reduces reliance on human evaluators for scoring and improves scalability compared to Williams and Huckle's (2004)[1] methods. Our results demonstrate that this framework significantly outperforms traditional multiple-choice evaluations while maintaining high accuracy standards. Additionally, we introduce an automated evaluation process leveraging smaller, locally hosted LLMs. We used LLaMA 3.2B, running on the Ollama binaries to streamline our assessments. This work establishes a new paradigm for evaluating LLM performance, emphasizing factual accuracy and reasoning ability, and provides a robust methodological foundation for future research.

主题：	计算与语言 (cs.CL)
MSC 类：	68T20
ACM 类：	I.2.7; I.2.6; H.3.3
引用方式：	arXiv:2501.00257 [cs.CL]
	(或者 arXiv:2501.00257v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.00257

提交历史

来自： Raymond Bernard [查看电子邮件]
[v1] 星期二， 2024 年 12 月 31 日 03:56:17 UTC (1,404 KB)

计算机科学 > 计算与语言

标题： EQUATOR：一种用于评估开放式问题中大语言模型推理的确定性框架。 # v1.0.0-beta

标题： EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： EQUATOR：一种用于评估开放式问题中大语言模型推理的确定性框架。 # v1.0.0-beta 显示英文标题

标题： EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： EQUATOR：一种用于评估开放式问题中大语言模型推理的确定性框架。 # v1.0.0-beta