Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance

Painsky, Amichai; Rosset, Saharon

统计学 > 机器学习

arXiv:1512.03444v1 (stat)

[提交于 2015年12月10日 ]

标题：基于树的方法中的交叉验证变量选择提高了预测性能

标题： Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance

Authors:Amichai Painsky, Saharon Rosset

摘要：递归划分方法生成类似树的模型是预测建模中长期使用的方法，在过去十年中主要作为先进集成方法如提升和随机森林中的“子学习者”。然而，常用树构建方法的划分（或分割）规则存在一个根本性缺陷，这使得它们无法平等地处理不同类型的变量。这种方法在处理具有大量类别的分类变量时表现不佳，而这类变量在大数据时代非常普遍。这些变量通常非常有信息量，但当前的树方法基本上让我们只能选择不使用它们，或者让我们的模型面临严重的过拟合问题。我们提出了一种概念框架，使用留一法（LOO）交叉验证来选择划分变量，然后对选定的变量进行常规划分（在我们的情况下，遵循CART的方法）。我们方法最重要的结果是，具有许多类别的分类变量可以在树构建中安全使用，并且只有在有助于预测能力时才会被选中。我们在广泛的模拟和实际数据分析中证明，我们的新型划分方法显著提高了单棵树模型和利用树的集成方法的性能。重要的是，我们设计了一个LOO划分变量选择算法，在合理假设下，与CART相比，对于二类分类任务不会增加总体计算复杂度。对于回归任务，我们的方法会增加计算负担，将CART划分规则搜索中的O(log(n))因子替换为O(n)项。

摘要： Recursive partitioning approaches producing tree-like models are a long standing staple of predictive modeling, in the last decade mostly as ``sub-learners'' within state of the art ensemble methods like Boosting and Random Forest. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods' inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. Such variables can often be very informative, but current tree methods essentially leave us a choice of either not using them, or exposing our models to severe overfitting. We propose a conceptual framework to splitting using leave-one-out (LOO) cross validation for selecting the splitting variable, then performing a regular split (in our case, following CART's approach) for the selected variable. The most important consequence of our approach is that categorical variables with many categories can be safely used in tree building and are only chosen if they contribute to predictive power. We demonstrate in extensive simulation and real data analysis that our novel splitting approach significantly improves the performance of both single tree models and ensemble methods that utilize trees. Importantly, we design an algorithm for LOO splitting variable selection which under reasonable assumptions does not increase the overall computational complexity compared to CART for two-class classification. For regression tasks, our approach carries an increased computational burden, replacing a O(log(n)) factor in CART splitting rule search with an O(n) term.

主题：	机器学习 (stat.ML)
引用方式：	arXiv:1512.03444 [stat.ML]
	(或者 arXiv:1512.03444v1 [stat.ML] 对于此版本)
	https://doi.org/10.48550/arXiv.1512.03444

提交历史

来自： Amichai Painsky [查看电子邮件]
[v1] 星期四， 2015 年 12 月 10 日 21:20:14 UTC (928 KB)

统计学 > 机器学习

标题：基于树的方法中的交叉验证变量选择提高了预测性能

标题： Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

统计学 > 机器学习

标题： 基于树的方法中的交叉验证变量选择提高了预测性能 显示英文标题

标题： Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基于树的方法中的交叉验证变量选择提高了预测性能