Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

Sculley, D.; Cukierski, Will; Culliton, Phil; Dane, Sohier; Demkin, Maggie; Holbrook, Ryan; Howard, Addison; Mooney, Paul; Reade, Walter; Risdal, Megan; Keating, Nate

计算机科学 > 人工智能

arXiv:2505.00612 (cs)

[提交于 2025年5月1日 (v1) ，最后修订 2025年5月29日 (此版本， v2)]

标题：位置：人工智能竞赛为通用人工智能评估提供了经验严谨性的黄金标准

标题： Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

Authors:D. Sculley, Will Cukierski, Phil Culliton, Sohier Dane, Maggie Demkin, Ryan Holbrook, Addison Howard, Paul Mooney, Walter Reade, Megan Risdal, Nate Keating

摘要：在这篇立场文件中，我们观察到生成式人工智能（Generative AI）的实证评估正处于危机点，因为传统的机器学习评估和基准测试策略不足以满足评估现代生成式人工智能模型和系统的需求。这一问题的原因有很多，包括这些模型通常具有几乎无限的输入和输出空间，通常没有明确的地面真实目标，并且通常表现出基于先前模型输出上下文的强大反馈回路和预测依赖性。除了这些关键问题之外，我们认为泄漏和污染问题是生成式人工智能评估中最重要且最难以解决的问题。有趣的是，人工智能竞赛领域已经开发出有效的措施和实践来对抗泄漏，目的是为了抵消竞争环境中不良行为者的作弊行为。这使得人工智能竞赛成为一种特别有价值（但未充分利用的）资源。现在是该领域将人工智能竞赛视为生成式人工智能评估经验严谨性的黄金标准的时候了，并且应该根据其价值来利用和收获其成果。

摘要： In this position paper, we observe that empirical evaluation in Generative AI is at a crisis point since traditional ML evaluation and benchmarking strategies are insufficient to meet the needs of evaluating modern GenAI models and systems. There are many reasons for this, including the fact that these models typically have nearly unbounded input and output spaces, typically do not have a well defined ground truth target, and typically exhibit strong feedback loops and prediction dependence based on context of previous model outputs. On top of these critical issues, we argue that the problems of leakage and contamination are in fact the most important and difficult issues to address for GenAI evaluations. Interestingly, the field of AI Competitions has developed effective measures and practices to combat leakage for the purpose of counteracting cheating by bad actors within a competition setting. This makes AI Competitions an especially valuable (but underutilized) resource. Now is time for the field to view AI Competitions as the gold standard for empirical rigor in GenAI evaluation, and to harness and harvest their results with according value.

主题：	人工智能 (cs.AI)
引用方式：	arXiv:2505.00612 [cs.AI]
	(或者 arXiv:2505.00612v2 [cs.AI] 对于此版本)
	https://doi.org/10.48550/arXiv.2505.00612

提交历史

来自： Megan Risdal [查看电子邮件]
[v1] 星期四， 2025 年 5 月 1 日 15:43:51 UTC (127 KB)
[v2] 星期四， 2025 年 5 月 29 日 01:48:23 UTC (761 KB)

计算机科学 > 人工智能

标题：位置：人工智能竞赛为通用人工智能评估提供了经验严谨性的黄金标准

标题： Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 人工智能

标题： 位置：人工智能竞赛为通用人工智能评估提供了经验严谨性的黄金标准 显示英文标题

标题： Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：位置：人工智能竞赛为通用人工智能评估提供了经验严谨性的黄金标准