Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics

Crook, Oliver M.; Gatto, Laurent; Kirk, Paul D. W.

统计学 > 方法论

arXiv:1810.05450 (stat)

[提交于 2018年10月12日 ]

标题：快速近似推断用于狄利克雷过程混合模型中的变量选择，应用于泛癌症蛋白质组学

标题： Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics

Authors:Oliver M. Crook, Laurent Gatto, Paul D. W. Kirk

摘要：狄利克雷过程（DP）混合模型已成为基于模型的聚类的流行选择，这在很大程度上是因为它允许推断聚类的数量。顺序更新和贪心搜索（SUGS）算法（Wang和Dunson，2011）被提出作为一种快速方法，在DP混合模型中进行近似贝叶斯推理，通过将聚类作为贝叶斯模型选择（BMS）问题来处理，并避免使用计算成本高昂的马尔可夫链蒙特卡罗方法。在这里，我们考虑如何扩展这种方法以允许聚类中的变量选择，并且还展示了用贝叶斯模型平均（BMA）代替BMS的好处。通过一系列模拟示例和癌症转录组学中的经典示例，我们表明我们的方法与当前最先进的方法相比表现相当，同时提供了计算上的优势。我们将我们的方法应用于来自癌症基因组图谱（TCGA）的反相蛋白阵列（RPPA）数据，以对5,157个肿瘤样本进行泛癌蛋白质组表征。我们已经实现了我们的方法，以及原始的SUGS算法，作为一个名为sugsvarsel的开源R包，该包通过在C++中执行密集计算并提供自动并行处理来加速分析。R包可以从以下地址免费获得：https://github.com/ococrook/sugsvarsel

摘要： The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang and Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5,157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel

评论：	27页，8张图。有关联的R包，请访问 https://github.com/ococrook/sugsvarsel
主题：	方法论 (stat.ME) ; 基因组学 (q-bio.GN); 应用 (stat.AP)
引用方式：	arXiv:1810.05450 [stat.ME]
	(或者 arXiv:1810.05450v1 [stat.ME] 对于此版本)
	https://doi.org/10.48550/arXiv.1810.05450

提交历史

来自： Paul Kirk [查看电子邮件]
[v1] 星期五， 2018 年 10 月 12 日 11:17:17 UTC (5,030 KB)

统计学 > 方法论

标题：快速近似推断用于狄利克雷过程混合模型中的变量选择，应用于泛癌症蛋白质组学

标题： Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

统计学 > 方法论

标题： 快速近似推断用于狄利克雷过程混合模型中的变量选择，应用于泛癌症蛋白质组学 显示英文标题

标题： Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：快速近似推断用于狄利克雷过程混合模型中的变量选择，应用于泛癌症蛋白质组学