High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

Ba, Jimmy; Erdogdu, Murat A.; Suzuki, Taiji; Wang, Zhichao; Wu, Denny; Yang, Greg

统计学 > 机器学习

arXiv:2205.01445 (stat)

[提交于 2022年5月3日 ]

标题：特征学习的高维渐近性：一步梯度如何改进表示

标题： High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

Authors:Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, Greg Yang

摘要：我们研究两层神经网络中第一层参数 $\boldsymbol{W}$ 的首次梯度下降步骤：$f(\boldsymbol{x}) = \frac{1}{\sqrt{N}}\boldsymbol{a}^\top\sigma(\boldsymbol{W}^\top\boldsymbol{x})$，其中 $\boldsymbol{W}\in\mathbb{R}^{d\times N}, \boldsymbol{a}\in\mathbb{R}^{N}$ 随机初始化，训练目标是经验均方误差（MSE）损失：$\frac{1}{n}\sum_{i=1}^n (f(\boldsymbol{x}_i)-y_i)^2$。在比例渐近极限下，当 $n,d,N\to\infty$ 以相同速率变化，并且处于理想的学生-教师设定中时，我们证明首次梯度更新包含一个秩-1“尖峰”，这导致第一层权重与教师模型 $f^*$ 的线性组件之间的对齐。为了刻画这种对齐的影响，我们计算了在单指标模型 $f^*$ 下，学习率 $\eta$ 条件下，对 $\boldsymbol{W}$ 进行一次梯度下降步长后的共轭核上的岭回归预测风险。我们考虑了初始学习率 $\eta$ 的两种缩放方式。对于较小的 $\eta$，我们建立了训练特征映射的高斯等价性质，并证明学习到的核比初始随机特征模型有所改进，但无法击败最佳线性模型。而对于足够大的 $\eta$，我们证明了对于某些 $f^*$，在训练特征上相同的岭估计量可以超越这个“线性区域”，并且优于广泛的随机特征和旋转不变核。我们的结果表明，即使一次梯度步长也可以比随机特征带来显著优势，并突出了学习率缩放在训练初始阶段的作用。

摘要： We study the first gradient descent step on the first-layer parameters $\boldsymbol{W}$ in a two-layer neural network: $f(\boldsymbol{x}) = \frac{1}{\sqrt{N}}\boldsymbol{a}^\top\sigma(\boldsymbol{W}^\top\boldsymbol{x})$, where $\boldsymbol{W}\in\mathbb{R}^{d\times N}, \boldsymbol{a}\in\mathbb{R}^{N}$ are randomly initialized, and the training objective is the empirical MSE loss: $\frac{1}{n}\sum_{i=1}^n (f(\boldsymbol{x}_i)-y_i)^2$. In the proportional asymptotic limit where $n,d,N\to\infty$ at the same rate, and an idealized student-teacher setting, we show that the first gradient update contains a rank-1 "spike", which results in an alignment between the first-layer weights and the linear component of the teacher model $f^*$. To characterize the impact of this alignment, we compute the prediction risk of ridge regression on the conjugate kernel after one gradient step on $\boldsymbol{W}$ with learning rate $\eta$, when $f^*$ is a single-index model. We consider two scalings of the first step learning rate $\eta$. For small $\eta$, we establish a Gaussian equivalence property for the trained feature map, and prove that the learned kernel improves upon the initial random features model, but cannot defeat the best linear model on the input. Whereas for sufficiently large $\eta$, we prove that for certain $f^*$, the same ridge estimator on trained features can go beyond this "linear regime" and outperform a wide range of random features and rotationally invariant kernels. Our results demonstrate that even one gradient step can lead to a considerable advantage over random features, and highlight the role of learning rate scaling in the initial phase of training.

评论：	71页
主题：	机器学习 (stat.ML) ; 机器学习 (cs.LG); 统计理论 (math.ST)
引用方式：	arXiv:2205.01445 [stat.ML]
	(或者 arXiv:2205.01445v1 [stat.ML] 对于此版本)
	https://doi.org/10.48550/arXiv.2205.01445

提交历史

来自： Denny Wu [查看电子邮件]
[v1] 星期二， 2022 年 5 月 3 日 12:09:59 UTC (1,403 KB)

统计学 > 机器学习

标题：特征学习的高维渐近性：一步梯度如何改进表示

标题： High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

统计学 > 机器学习

标题： 特征学习的高维渐近性：一步梯度如何改进表示 显示英文标题

标题： High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：特征学习的高维渐近性：一步梯度如何改进表示