Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

Fan, Zhiyuan; Liu, Yifeng; Zhao, Qingyue; Yuan, Angela; Gu, Quanquan

计算机科学 > 机器学习

arXiv:2510.15262v1 (cs)

[提交于 2025年10月17日 ]

标题：通过适当权重衰减调整的鲁棒分层缩放规则

标题： Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

Authors:Zhiyuan Fan, Yifeng Liu, Qingyue Zhao, Angela Yuan, Quanquan Gu

摘要：经验缩放定律规定了如何分配参数、数据和计算，而最大更新参数化（$\mu$P）通过使早期时间更新幅度相等，实现了宽度之间的学习率迁移。然而，在现代尺度不变架构中，训练很快进入优化器主导的稳态，其中归一化层创建了反向尺度敏感性，有效学习率变得与宽度相关，从而损害了$\mu$P的迁移。我们通过为AdamW引入一个权重衰减缩放规则来解决这个问题，该规则在不同宽度下保持子层增益。经验上，每个矩阵参数的奇异值谱的范数按$\sqrt{\eta/\lambda}$缩放，其形状大致不变；在宽度缩放$d$下，我们观察到最大奇异值大约按$\sqrt{\eta/\lambda}\cdot d^{0.75}$缩放。结合这一观察与矩阵类参数的$\mu$学习率规则$\eta_2\propto d^{-1}$，意味着一种经验性的权重衰减缩放规则$\lambda_2\propto \sqrt{d}$，该规则近似保持子层增益宽度不变。与在$\eta_1=\Theta_d(1)$和$\lambda_1=0$训练的向量类参数一起，这产生了\emph{零次学习}从代理宽度到目标宽度的学习率和权重衰减的迁移，消除了每宽度扫描。我们在 LLaMA 风格的 Transformers 和一个最小合成设置中验证了该规则，并提供了一个简单的诊断方法，匹配最大奇异值，以检查子层增益不变性。我们的结果通过显式控制优化器设定的稳态尺度，将$\mu$P 扩展到接近初始区域之外，为 AdamW 下的宽度鲁棒超参数迁移提供了一个实用的方案。

摘要： Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization ($\mu$P) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady state where normalization layers create backward scale sensitivity and the effective learning rate becomes width dependent, degrading $\mu$P transfer. We address this by introducing a weight-decay scaling rule for AdamW that preserves sublayer gain across widths. Empirically, the singular-value spectrum of each matrix parameter scales in norm as $\sqrt{\eta/\lambda}$ with an approximately invariant shape; under width scaling $d$, we observe that the top singular value scales approximately as $\sqrt{\eta/\lambda}\cdot d^{0.75}$. Combining this observation with the $\mu$P learning-rate rule $\eta_2\propto d^{-1}$ for matrix-like parameters implies an empirical weight-decay scaling rule $\lambda_2\propto \sqrt{d}$ that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at $\eta_1=\Theta_d(1)$ and $\lambda_1=0$, this yields \emph{zero-shot} transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps. We validate the rule on LLaMA-style Transformers and in a minimal synthetic setting, and we provide a simple diagnostic, matching top singular values, to check sublayer-gain invariance. Our results extend $\mu$P beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, offering a practical recipe for width-robust hyperparameter transfer under AdamW.

主题：	机器学习 (cs.LG) ; 人工智能 (cs.AI); 机器学习 (stat.ML)
引用方式：	arXiv:2510.15262 [cs.LG]
	(或者 arXiv:2510.15262v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.15262

提交历史

来自： Zhiyuan Fan [查看电子邮件]
[v1] 星期五， 2025 年 10 月 17 日 02:58:35 UTC (137 KB)

计算机科学 > 机器学习

标题：通过适当权重衰减调整的鲁棒分层缩放规则

标题： Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 通过适当权重衰减调整的鲁棒分层缩放规则 显示英文标题

标题： Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：通过适当权重衰减调整的鲁棒分层缩放规则