HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection

Emery, Deanna; Goitia, Michael; Vargus, Freddie; Neagu, Iulia

计算机科学 > 计算与语言

arXiv:2505.00506v1 (cs)

[提交于 2025年5月1日 ]

标题： HalluMix：一种面向任务无关、多领域的现实世界幻觉检测基准

标题： HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection

Authors:Deanna Emery, Michael Goitia, Freddie Vargus, Iulia Neagu

摘要：随着大型语言模型（LLMs）越来越多地部署在高风险领域，检测幻觉内容$\unicode{x2013}$——即未基于支持性证据$\unicode{x2013}$的文本——已成为一个关键挑战。现有的幻觉检测基准通常是合成生成的，仅狭隘地集中在提取式问答上，并且未能捕捉到涉及多文档上下文和完整句子输出的真实世界场景的复杂性。我们介绍了 HalluMix Benchmark，这是一个多样化的、任务无关的数据集，包括来自各种领域和格式的示例。使用这个基准，我们评估了七种幻觉检测系统$\unicode{x2013}$既有开源的也有闭源的$\unicode{x2013}$，突出显示了不同任务、文档长度和输入表示下的性能差异。我们的分析突显了短上下文和长上下文之间的巨大性能差异，这对现实世界中的检索增强生成（RAG）实现具有重要的意义。 Quotient Detections 在总体表现最佳，准确率为 0.82，F1 得分为 0.84。

摘要： As large language models (LLMs) are increasingly deployed in high-stakes domains, detecting hallucinated content$\unicode{x2013}$text that is not grounded in supporting evidence$\unicode{x2013}$has become a critical challenge. Existing benchmarks for hallucination detection are often synthetically generated, narrowly focused on extractive question answering, and fail to capture the complexity of real-world scenarios involving multi-document contexts and full-sentence outputs. We introduce the HalluMix Benchmark, a diverse, task-agnostic dataset that includes examples from a range of domains and formats. Using this benchmark, we evaluate seven hallucination detection systems$\unicode{x2013}$both open and closed source$\unicode{x2013}$highlighting differences in performance across tasks, document lengths, and input representations. Our analysis highlights substantial performance disparities between short and long contexts, with critical implications for real-world Retrieval Augmented Generation (RAG) implementations. Quotient Detections achieves the best overall performance, with an accuracy of 0.82 and an F1 score of 0.84.

主题：	计算与语言 (cs.CL) ; 人工智能 (cs.AI)
引用方式：	arXiv:2505.00506 [cs.CL]
	(或者 arXiv:2505.00506v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2505.00506

提交历史

来自： Deanna Emery [查看电子邮件]
[v1] 星期四， 2025 年 5 月 1 日 13:22:45 UTC (657 KB)

计算机科学 > 计算与语言

标题： HalluMix：一种面向任务无关、多领域的现实世界幻觉检测基准

标题： HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： HalluMix：一种面向任务无关、多领域的现实世界幻觉检测基准 显示英文标题

标题： HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： HalluMix：一种面向任务无关、多领域的现实世界幻觉检测基准