RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

Bi, Baolong; Liu, Shenghua; Ren, Xingzhang; Liu, Dayiheng; Lin, Junyang; Wang, Yiwei; Mei, Lingrui; Fang, Junfeng; Guo, Jiafeng; Cheng, Xueqi

计算机科学 > 计算与语言

arXiv:2507.03253 (cs)

[提交于 2025年7月4日 (v1) ，最后修订 2025年7月8日 (此版本， v2)]

标题： RefineX：从专家引导的程序中大规模学习精炼预训练数据

标题： RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

Authors:Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, Xueqi Cheng

摘要：大型语言模型（LLMs）的基础能力深受其预训练语料库质量的影响。然而，在大规模提升数据质量方面仍是一个重大挑战，主要是由于精炼效果与处理效率之间的权衡。尽管基于规则的过滤仍然是主流方法，但它通常在文档级别运行，并缺乏细化文档内特定内容所需的粒度。受ProX等新兴工作的启发，我们提出了 $\textbf{RefineX}$，一种通过编程编辑任务实现大规模、精准细化预训练数据的新框架。 RefineX能够在可靠保持原始文本多样性和自然性的同时，高效且精细地进行数据细化。 RefineX的核心优势在于将高质量、专家指导的端到端细化结果提炼为最小化的基于删除的程序。这个高精度的提炼流程用于训练一个高效且可靠的细化模型，该模型可以系统地在大规模上改进语料库中的每个实例。我们在多个模型规模上进行了从头开始的预训练评估，发现RefineX在各种下游任务中始终优于在原始、过滤或另外细化的数据上训练的模型。在750M模型上，RefineX在lighteval任务上平均提升了2.6%-7.2%，并且使用显著更少的训练标记实现了相当的性能。进一步分析表明，RefineX以高效率和精确度可靠地提升了文本质量，优于之前的方法如端到端生成和Prox-C。这些结果使RefineX成为现代LLM流水线中优化预训练数据的可扩展、有效且可靠解决方案。

摘要： The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose $\textbf{RefineX}$, a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs. This high-precision distillation pipeline is used to train an efficient and reliable refine model that can systematically improve every instance in the corpus at scale. We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data across diverse downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on lighteval tasks, and achieves comparable performance using significantly fewer training tokens. Further analysis shows that RefineX reliably enhances text quality with both high efficiency and precision, outperforming prior approaches such as end-to-end generation and Prox-C. These results position RefineX as a scalable, effective, and reliable solution for optimizing pre-training data in modern LLM pipelines.

主题：	计算与语言 (cs.CL) ; 人工智能 (cs.AI)
引用方式：	arXiv:2507.03253 [cs.CL]
	(或者 arXiv:2507.03253v2 [cs.CL] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.03253

提交历史

来自： Baolong Bi [查看电子邮件]
[v1] 星期五， 2025 年 7 月 4 日 02:19:58 UTC (1,302 KB)
[v2] 星期二， 2025 年 7 月 8 日 18:15:09 UTC (1,302 KB)

计算机科学 > 计算与语言

标题： RefineX：从专家引导的程序中大规模学习精炼预训练数据

标题： RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 计算与语言

标题： RefineX：从专家引导的程序中大规模学习精炼预训练数据 显示英文标题

标题： RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： RefineX：从专家引导的程序中大规模学习精炼预训练数据