RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

Liu, Wanlong; Chen, Junying; Ji, Ke; Zhou, Li; Chen, Wenyu; Wang, Benyou

计算机科学 > 计算与语言

arXiv:2501.00353 (cs)

[提交于 2024年12月31日 ]

标题： RAG-Instruct：通过多样化检索增强指令提升大型语言模型

标题： RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

Authors:Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, Benyou Wang

摘要：检索增强生成（RAG）已成为一种关键范式，通过结合外部知识来增强大型语言模型（LLMs）。然而，当前的RAG方法面临两个限制：（1）它们仅涵盖有限的RAG场景。（2）由于缺乏通用的RAG数据集，任务多样性受到限制。为了解决这些限制，我们提出了RAG-Instruct，这是一种基于任何源语料库合成多样且高质量RAG指令数据的通用方法。我们的方法利用了（1）五种RAG范式，涵盖了多样化的查询-文档关系，以及（2）指令模拟，通过利用现有指令数据集的优势来增强指令的多样性和质量。使用这种方法，我们从维基百科构建了一个40K指令数据集，全面覆盖了多样的RAG场景和任务。实验表明，RAG-Instruct有效增强了LLMs的RAG能力，在零样本性能方面表现强劲，并在多种任务中显著优于各种RAG基线。 RAG-Instruct可在https://github.com/FreedomIntelligence/RAG-Instruct公开获取。

摘要： Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs' RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks. RAG-Instruct is publicly available at https://github.com/FreedomIntelligence/RAG-Instruct.

主题：	计算与语言 (cs.CL) ; 人工智能 (cs.AI); 机器学习 (cs.LG)
引用方式：	arXiv:2501.00353 [cs.CL]
	(或者 arXiv:2501.00353v1 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.00353

提交历史

来自： Junying Chen [查看电子邮件]
[v1] 星期二， 2024 年 12 月 31 日 09:00:51 UTC (5,206 KB)

计算机科学 > 计算与语言

标题： RAG-Instruct：通过多样化检索增强指令提升大型语言模型

标题： RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： RAG-Instruct：通过多样化检索增强指令提升大型语言模型 显示英文标题

标题： RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： RAG-Instruct：通过多样化检索增强指令提升大型语言模型