Bayesian Additive Regression Trees using Bayesian Model Averaging

Hernández, Belinda; Raftery, Adrian E.; Pennington, Stephen R.; Parnell, Andrew C.

统计学 > 计算

arXiv:1507.00181v2 (stat)

[提交于 2015年7月1日 (v1) ，最后修订 2015年7月8日 (此版本， v2)]

标题：使用贝叶斯模型平均的贝叶斯增强回归树

标题： Bayesian Additive Regression Trees using Bayesian Model Averaging

Authors:Belinda Hernández, Adrian E. Raftery, Stephen R. Pennington, Andrew C. Parnell

摘要：贝叶斯增强回归树（Bayesian Additive Regression Trees, BART）是一种统计学的树模型求和方法。它可以被视为机器学习树集成方法的贝叶斯版本，其中单个树是基础学习器。然而，对于变量数量 $p$ 较大的数据集（例如 $p>5,000$），该算法在计算上可能变得过于昂贵。另一种在高维数据中流行的算法是随机森林，这是一种通过贪心搜索最佳分割点来生成树的机器学习算法。但是，由于它不是一个统计模型，因此无法产生概率估计或预测。我们提出了一种名为BART-BMA的BART替代算法，该算法使用贝叶斯模型平均法和贪心搜索算法生成一个比BART更高效的模型，适用于具有大 $p$ 的数据集。 BART-BMA结合了BART和随机森林的元素，提供了一种基于模型的算法，可以处理高维数据。我们发现，BART-BMA可以在标准笔记本电脑上以合理的时间运行，在生物信息学的许多领域中常见的“小 $n$ 大 $p$”场景中。我们通过模拟数据以及来自两个真实蛋白质组实验的数据展示这种方法；一个是区分心血管疾病患者与对照组，另一个是分类侵袭性与非侵袭性前列腺癌。我们将结果与主要竞争对手进行了比较。用于运行BART-BMA的开源代码用R和Rcpp编写，可在以下地址找到：https://github.com/BelindaHernandez/BART-BMA.git

摘要： Bayesian Additive Regression Trees (BART) is a statistical sum of trees model. It can be considered a Bayesian version of machine learning tree ensemble methods where the individual trees are the base learners. However for data sets where the number of variables $p$ is large (e.g. $p>5,000$) the algorithm can become prohibitively expensive, computationally. Another method which is popular for high dimensional data is random forests, a machine learning algorithm which grows trees using a greedy search for the best split points. However, as it is not a statistical model, it cannot produce probabilistic estimates or predictions. We propose an alternative algorithm for BART called BART-BMA, which uses Bayesian Model Averaging and a greedy search algorithm to produce a model which is much more efficient than BART for datasets with large $p$. BART-BMA incorporates elements of both BART and random forests to offer a model-based algorithm which can deal with high-dimensional data. We have found that BART-BMA can be run in a reasonable time on a standard laptop for the "small $n$ large $p$" scenario which is common in many areas of bioinformatics. We showcase this method using simulated data and data from two real proteomic experiments; one to distinguish between patients with cardiovascular disease and controls and another to classify agressive from non-agressive prostate cancer. We compare our results to their main competitors. Open source code written in R and Rcpp to run BART-BMA can be found at: https://github.com/BelindaHernandez/BART-BMA.git

主题：	计算 (stat.CO) ; 方法论 (stat.ME)
引用方式：	arXiv:1507.00181 [stat.CO]
	(或者 arXiv:1507.00181v2 [stat.CO] 对于此版本)
	https://doi.org/10.48550/arXiv.1507.00181

提交历史

来自： Belinda Hernandez [查看电子邮件]
[v1] 星期三， 2015 年 7 月 1 日 10:58:46 UTC (152 KB)
[v2] 星期三， 2015 年 7 月 8 日 14:08:27 UTC (152 KB)

统计学 > 计算

标题：使用贝叶斯模型平均的贝叶斯增强回归树

标题： Bayesian Additive Regression Trees using Bayesian Model Averaging

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

统计学 > 计算

标题： 使用贝叶斯模型平均的贝叶斯增强回归树 显示英文标题

标题： Bayesian Additive Regression Trees using Bayesian Model Averaging

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：使用贝叶斯模型平均的贝叶斯增强回归树