Tensor Topic Modeling Via HOSVD

Liu, Yating; Donnat, Claire

数学 > 统计理论

arXiv:2501.00535 (math)

[提交于 2024年12月31日 ]

标题：基于HOSVD的张量主题建模

标题： Tensor Topic Modeling Via HOSVD

Authors:Yating Liu, Claire Donnat

摘要：通过将文档表示为主题混合，主题建模已经成功地分析了跨越从生态学到遗传学等多个应用领域的数据集。近期的一项重要工作展示了概率潜在语义索引（pLSI）——一种主题建模类型——在估计主题矩阵（对应于词频分布）和主题分配矩阵方面的计算和统计效率。然而，这些方法不容易扩展以纳入额外的时间、空间或文档特定的信息，因此可能忽略了在可以表示为张量的空间或纵向数据集分析中潜在的有用信息。因此，在本文中，我们提出使用修改后的高阶奇异值分解（HOSVD）来基于Tucker分解估计主题模型，从而适应张量数据的复杂性。我们的方法利用了张量分解在将数据降维到低维空间中的优势，并成功恢复了低秩的主题和聚类结构，以及一个核心张量，该张量突出了潜在因素之间的交互作用。我们进一步明确描述了该方法在逐元素$\ell_1$范数下的收敛率。合成数据上的实验表明了我们方法的统计效率及其在多个维度上更好地捕捉模式的能力。此外，当应用于研究摘要的大规模数据集和阴道微生物组数据分析时，我们的方法也表现良好。

摘要： By representing documents as mixtures of topics, topic modeling has allowed the successful analysis of datasets across a wide spectrum of applications ranging from ecology to genetics. An important body of recent work has demonstrated the computational and statistical efficiency of probabilistic Latent Semantic Indexing (pLSI)-- a type of topic modeling -- in estimating both the topic matrix (corresponding to distributions over word frequencies), and the topic assignment matrix. However, these methods are not easily extendable to the incorporation of additional temporal, spatial, or document-specific information, thereby potentially neglecting useful information in the analysis of spatial or longitudinal datasets that can be represented as tensors. Consequently, in this paper, we propose using a modified higher-order singular value decomposition (HOSVD) to estimate topic models based on a Tucker decomposition, thus accommodating the complexity of tensor data. Our method exploits the strength of tensor decomposition in reducing data to lower-dimensional spaces and successfully recovers lower-rank topic and cluster structures, as well as a core tensor that highlights interactions among latent factors. We further characterize explicitly the convergence rate of our method in entry-wise $\ell_1$ norm. Experiments on synthetic data demonstrate the statistical efficiency of our method and its ability to better capture patterns across multiple dimensions. Additionally, our approach also performs well when applied to large datasets of research abstracts and in the analysis of vaginal microbiome data.

主题：	统计理论 (math.ST) ; 应用 (stat.AP)
引用方式：	arXiv:2501.00535 [math.ST]
	(或者 arXiv:2501.00535v1 [math.ST] 对于此版本)
	https://doi.org/10.48550/arXiv.2501.00535

提交历史

来自： Yating Liu [查看电子邮件]
[v1] 星期二， 2024 年 12 月 31 日 16:40:04 UTC (12,960 KB)

数学 > 统计理论

标题：基于HOSVD的张量主题建模

标题： Tensor Topic Modeling Via HOSVD

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

数学 > 统计理论

标题： 基于HOSVD的张量主题建模 显示英文标题

标题： Tensor Topic Modeling Via HOSVD

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基于HOSVD的张量主题建模