GenTREC: The First Test Collection Generated by Large Language Models for Evaluating Information Retrieval Systems

Türkmen, Mehmet Deniz; Kutlu, Mucahid; Altun, Bahadir; Cosgun, Gokalp

计算机科学 > 信息检索

arXiv:2501.02408 (cs)

[提交于 2025年1月5日 ]

标题： GenTREC：由大型语言模型生成的第一个用于评估信息检索系统的测试集合

标题： GenTREC: The First Test Collection Generated by Large Language Models for Evaluating Information Retrieval Systems

Authors:Mehmet Deniz Türkmen, Mucahid Kutlu, Bahadir Altun, Gokalp Cosgun

摘要：构建信息检索评估的测试集合传统上是一项资源密集且耗时的任务，主要是由于依赖于人工的相关性判断。尽管已经探索了各种成本效益高的策略，但此类集合的开发仍然是一个重大挑战。在本文中，我们介绍了GenTREC，这是第一个完全由大型语言模型（LLM）生成的文档构建的测试集合，消除了对人工相关性判断的需求。我们的方法基于这样一个假设，即由LLM生成的文档本质上与其生成所使用的提示相关。基于这一启发式方法，我们利用现有的TREC搜索主题生成文档。我们只认为生成文档的提示是相关的，而其他文档-主题对则被视为不相关。为了引入现实的检索挑战，我们还生成了不相关的文档，确保信息检索系统能够针对多样化和强大的材料集进行测试。生成的GenTREC集合包含96,196个文档、300个主题和18,964个相关性“判断”。我们进行了广泛的实验，以文档质量、相关性判断准确性和评估可靠性等方面评估GenTREC。值得注意的是，我们的研究结果表明，使用GenTREC对信息检索系统的排名与使用传统TREC测试集合进行的评估相容，特别是在P@100、MAP和RPrec指标上。总体而言，我们的结果表明，我们提出的方法为信息检索评估提供了一个有前景的低成本替代方案，显著降低了构建和维护未来信息检索评估资源的负担。

摘要： Building test collections for Information Retrieval evaluation has traditionally been a resource-intensive and time-consuming task, primarily due to the dependence on manual relevance judgments. While various cost-effective strategies have been explored, the development of such collections remains a significant challenge. In this paper, we present GenTREC , the first test collection constructed entirely from documents generated by a Large Language Model (LLM), eliminating the need for manual relevance judgments. Our approach is based on the assumption that documents generated by an LLM are inherently relevant to the prompts used for their generation. Based on this heuristic, we utilized existing TREC search topics to generate documents. We consider a document relevant only to the prompt that generated it, while other document-topic pairs are treated as non-relevant. To introduce realistic retrieval challenges, we also generated non-relevant documents, ensuring that IR systems are tested against a diverse and robust set of materials. The resulting GenTREC collection comprises 96,196 documents, 300 topics, and 18,964 relevance "judgments". We conducted extensive experiments to evaluate GenTREC in terms of document quality, relevance judgment accuracy, and evaluation reliability. Notably, our findings indicate that the ranking of IR systems using GenTREC is compatible with the evaluations conducted using traditional TREC test collections, particularly for P@100, MAP, and RPrec metrics. Overall, our results show that our proposed approach offers a promising, low-cost alternative for IR evaluation, significantly reducing the burden of building and maintaining future IR evaluation resources.

主题：	信息检索 (cs.IR)
引用方式：	arXiv:2501.02408 [cs.IR]
	(或者 arXiv:2501.02408v1 [cs.IR] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.02408

提交历史

来自： Mucahid Kutlu [查看电子邮件]
[v1] 星期日， 2025 年 1 月 5 日 00:27:36 UTC (3,762 KB)

计算机科学 > 信息检索

标题： GenTREC：由大型语言模型生成的第一个用于评估信息检索系统的测试集合

标题： GenTREC: The First Test Collection Generated by Large Language Models for Evaluating Information Retrieval Systems

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 信息检索

标题： GenTREC：由大型语言模型生成的第一个用于评估信息检索系统的测试集合 显示英文标题

标题： GenTREC: The First Test Collection Generated by Large Language Models for Evaluating Information Retrieval Systems

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： GenTREC：由大型语言模型生成的第一个用于评估信息检索系统的测试集合