Stacked SVD or SVD stacked? A Random Matrix Theory perspective on data integration

Baharav, Tavor Z.; Nicol, Phillip B.; Irizarry, Rafael A.; Ma, Rong

统计学 > 机器学习

arXiv:2507.22170 (stat)

[提交于 2025年7月29日 ]

标题：堆叠的SVD还是SVD堆叠？数据整合的随机矩阵理论视角

标题： Stacked SVD or SVD stacked? A Random Matrix Theory perspective on data integration

Authors:Tavor Z. Baharav, Phillip B. Nicol, Rafael A. Irizarry, Rong Ma

摘要：现代数据分析越来越需要在多个高维数据集之间识别共享的潜在结构。一种常用模型假设数据矩阵是具有共享奇异子空间的低秩矩阵的噪声观测值。在这种情况下，已经出现了两种主要方法来估计这种共享结构，它们在如何整合不同数据集的信息方面有所不同。第一种方法称为Stack-SVD，将所有数据集拼接在一起，然后进行奇异值分解（SVD）。第二种方法称为SVD-Stack，首先对每个数据集分别进行SVD，然后汇总这些数据集中的顶部奇异向量，并最终计算它们的一致性。尽管这些方法被广泛使用，但在比例渐近 regime 中尚未得到严格研究，而这一 regime 在当今数据规模和维度不断增加的背景下具有重要的实际意义。缺乏理论理解导致在选择哪种方法上存在不确定性，并限制了充分挖掘其潜力的能力。为了解决这些挑战，我们推导了这两种方法的渐近性能和相变的精确表达式，并开发了最优加权方案以进一步改进这两种方法。我们的分析表明，在未加权的情况下，这两种方法并不总是相互占优，但最优加权的Stack-SVD优于最优加权的SVD-Stack。我们将分析扩展到适应多个共享成分，并提供了从数据中估计最优权重的实用算法，为实际数据整合问题中的方法选择提供了理论指导。在基因组数据上的大量数值模拟和半合成实验验证了我们的理论结果。

摘要： Modern data analysis increasingly requires identifying shared latent structure across multiple high-dimensional datasets. A commonly used model assumes that the data matrices are noisy observations of low-rank matrices with a shared singular subspace. In this case, two primary methods have emerged for estimating this shared structure, which vary in how they integrate information across datasets. The first approach, termed Stack-SVD, concatenates all the datasets, and then performs a singular value decomposition (SVD). The second approach, termed SVD-Stack, first performs an SVD separately for each dataset, then aggregates the top singular vectors across these datasets, and finally computes a consensus amongst them. While these methods are widely used, they have not been rigorously studied in the proportional asymptotic regime, which is of great practical relevance in today's world of increasing data size and dimensionality. This lack of theoretical understanding has led to uncertainty about which method to choose and limited the ability to fully exploit their potential. To address these challenges, we derive exact expressions for the asymptotic performance and phase transitions of these two methods and develop optimal weighting schemes to further improve both methods. Our analysis reveals that while neither method uniformly dominates the other in the unweighted case, optimally weighted Stack-SVD dominates optimally weighted SVD-Stack. We extend our analysis to accommodate multiple shared components, and provide practical algorithms for estimating optimal weights from data, offering theoretical guidance for method selection in practical data integration problems. Extensive numerical simulations and semi-synthetic experiments on genomic data corroborate our theoretical findings.

主题：	机器学习 (stat.ML) ; 机器学习 (cs.LG); 统计理论 (math.ST); 方法论 (stat.ME)
MSC 类：	15A18, 62H25
引用方式：	arXiv:2507.22170 [stat.ML]
	(或者 arXiv:2507.22170v1 [stat.ML] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.22170

提交历史

来自： Tavor Baharav [查看电子邮件]
[v1] 星期二， 2025 年 7 月 29 日 19:03:01 UTC (908 KB)

统计学 > 机器学习

标题：堆叠的SVD还是SVD堆叠？数据整合的随机矩阵理论视角

标题： Stacked SVD or SVD stacked? A Random Matrix Theory perspective on data integration

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

统计学 > 机器学习

标题： 堆叠的SVD还是SVD堆叠？ 数据整合的随机矩阵理论视角 显示英文标题

标题： Stacked SVD or SVD stacked? A Random Matrix Theory perspective on data integration

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：堆叠的SVD还是SVD堆叠？数据整合的随机矩阵理论视角