DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

Qi, Ji; Zhu, WenPeng; Li, Li; Wu, Ming; Wu, YingJun; He, Wu; Gao, Xun; Zeng, Jason; Heinrich, Michael

计算机科学 > 机器学习

arXiv:2506.21263 (cs)

[提交于 2025年6月26日 ]

标题： DiLoCoX：一种用于去中心化集群的低通信大规模训练框架

标题： DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

Authors:Ji Qi, WenPeng Zhu, Li Li, Ming Wu, YingJun Wu, Wu He, Xun Gao, Jason Zeng, Michael Heinrich

摘要：分布式训练基础模型，特别是大型语言模型（LLMs），需要高水平的通信。因此，它高度依赖于具有快速可靠互连的集中式集群。在处理参数超过1000亿的模型时，我们能否在慢速网络上进行训练，从而释放去中心化集群的潜力？在本文中，我们提出了DiLoCoX，这是一种低通信的大规模去中心化集群训练框架。它结合了流水线并行与双优化器策略、通信和本地训练的一步延迟重叠，以及自适应梯度压缩方案。这种组合显著提高了参数规模和模型预训练的速度。我们通过收敛性的理论分析来证明通信和本地训练的一步延迟重叠以及自适应梯度压缩方案的优势。实证上，我们证明DiLoCoX能够在1Gbps网络上预训练一个107B的基础模型。与原始AllReduce相比，DiLoCoX在保持模型收敛性几乎无下降的情况下，可以实现分布式训练357倍的加速。据我们所知，这是首个成功应用于超过1000亿参数模型的去中心化训练框架。

摘要： The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we conduct training on slow networks and thereby unleash the power of decentralized clusters when dealing with models exceeding 100 billion parameters? In this paper, we propose DiLoCoX, a low-communication large-scale decentralized cluster training framework. It combines Pipeline Parallelism with Dual Optimizer Policy, One-Step-Delay Overlap of Communication and Local Training, and an Adaptive Gradient Compression Scheme. This combination significantly improves the scale of parameters and the speed of model pre-training. We justify the benefits of one-step-delay overlap of communication and local training, as well as the adaptive gradient compression scheme, through a theoretical analysis of convergence. Empirically, we demonstrate that DiLoCoX is capable of pre-training a 107B foundation model over a 1Gbps network. Compared to vanilla AllReduce, DiLoCoX can achieve a 357x speedup in distributed training while maintaining negligible degradation in model convergence. To the best of our knowledge, this is the first decentralized training framework successfully applied to models with over 100 billion parameters.

主题：	机器学习 (cs.LG) ; 人工智能 (cs.AI); 计算与语言 (cs.CL)
引用方式：	arXiv:2506.21263 [cs.LG]
	(或者 arXiv:2506.21263v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2506.21263

提交历史

来自： Wu He [查看电子邮件]
[v1] 星期四， 2025 年 6 月 26 日 13:45:04 UTC (320 KB)

计算机科学 > 机器学习

标题： DiLoCoX：一种用于去中心化集群的低通信大规模训练框架

标题： DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： DiLoCoX：一种用于去中心化集群的低通信大规模训练框架 显示英文标题

标题： DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： DiLoCoX：一种用于去中心化集群的低通信大规模训练框架