R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

Ge, Albert; Huang, Tzu-Heng; Cooper, John; Trost, Avi; Chu, Ziyi; GNVV, Satya Sai Srinath Namburi; Cai, Ziyang; Park, Kendall; Roberts, Nicholas; Sala, Frederic

计算机科学 > 机器学习

arXiv:2505.00358 (cs)

[提交于 2025年5月1日 ]

标题： R&B：域重组和数据混合平衡，实现高效的基础模型训练

标题： R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

Authors:Albert Ge, Tzu-Heng Huang, John Cooper, Avi Trost, Ziyi Chu, Satya Sai Srinath Namburi GNVV, Ziyang Cai, Kendall Park, Nicholas Roberts, Frederic Sala

摘要：数据混合策略已成功降低语言模型训练的成本。尽管这些方法前景光明，但它们也存在两个缺陷。首先，它们依赖于预先确定的数据域（例如，数据源、任务类型），这可能无法捕捉关键的语义细微差别，从而导致性能不佳。其次，这些方法会随着域数量的增加而扩展，计算量过大。我们通过 R&B 框架应对这些挑战。该框架基于语义相似性对训练数据进行重新划分（重组），以创建更细粒度的域，并利用由训练过程中获得的域梯度引起的 Gram 矩阵有效地优化数据组合（平衡）。与先前的研究不同，它无需额外的计算即可获得损失或梯度等评估信息。我们在标准正则条件下分析了这项技术，并提供了理论见解，证明了 R&B 相对于非自适应混合方法的有效性。我们通过实证研究证明了 R&B 在五个不同数据集（涵盖自然语言、推理和多模态任务）上的有效性。仅需额外增加 0.01% 的计算开销，R&B 的性能便可匹敌甚至超越最先进的数据混合策略。

摘要： Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e.g., data sources, task types), which may fail to capture critical semantic nuances, leaving performance on the table. Second, these methods scale with the number of domains in a computationally prohibitive way. We address these challenges via R&B, a framework that re-partitions training data based on semantic similarity (Regroup) to create finer-grained domains, and efficiently optimizes the data composition (Balance) by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, it removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and provide theoretical insights that justify R&B's effectiveness compared to non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness of R&B on five diverse datasets ranging from natural language to reasoning and multimodal tasks. With as little as 0.01% additional compute overhead, R&B matches or exceeds the performance of state-of-the-art data mixing strategies.

主题：	机器学习 (cs.LG) ; 人工智能 (cs.AI); 计算与语言 (cs.CL)
引用方式：	arXiv:2505.00358 [cs.LG]
	(或者 arXiv:2505.00358v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2505.00358

提交历史

来自： Albert Ge [查看电子邮件]
[v1] 星期四， 2025 年 5 月 1 日 07:08:19 UTC (2,893 KB)

计算机科学 > 机器学习

标题： R&B：域重组和数据混合平衡，实现高效的基础模型训练

标题： R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： R&B：域重组和数据混合平衡，实现高效的基础模型训练 显示英文标题

标题： R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： R&B：域重组和数据混合平衡，实现高效的基础模型训练