Regularizing Model Complexity and Label Structure for Multi-Label Text Classification

Wang, Bingyu; Li, Cheng; Pavlu, Virgil; Aslam, Javed

统计学 > 机器学习

arXiv:1705.00740 (stat)

[提交于 2017年5月1日 ]

标题：正则化模型复杂度和标签结构用于多标签文本分类

标题： Regularizing Model Complexity and Label Structure for Multi-Label Text Classification

Authors:Bingyu Wang, Cheng Li, Virgil Pavlu, Javed Aslam

摘要：多标签文本分类是一种流行的机器学习任务，其中每篇文档会被赋予多个相关的标签。由于高维特征和相关标签的存在，这项任务具有挑战性。多标签文本分类器需要被仔细正则化以防止在高维空间中的严重过拟合，并且还需要考虑标签依赖关系以便在不确定性下做出准确的预测。我们通过在训练阶段仔细正则化模型复杂度以及在预测阶段正则化标签搜索空间，展示了显著且实用的改进。具体而言，我们使用Elastic-net（L1+L2）惩罚来正则化分类器训练以减少模型复杂度/大小，并采用早期停止来防止过拟合。在预测时，我们应用支持推理来将搜索空间限制为训练集中遇到的标签集，并使用F优化器GFM针对F1指标进行最优预测。我们表明，尽管支持推理仅在现有标签组合上提供密度估计，但与GFM预测器结合后，该算法可以输出未见的标签组合。总体而言，我们的实验显示了在许多基准数据集上的最先进结果。除了性能和实际贡献之外，我们还观察到一些有趣的现象。与先前认为支持推理仅为一种近似推理程序的观点相反，我们表明支持推理实际上作为标签预测结构的强大正则化器起作用。它允许分类器在预测过程中即使在训练期间未建模任何标签依赖关系的情况下，也能考虑到标签依赖关系。

摘要： Multi-label text classification is a popular machine learning task where each document is assigned with multiple relevant labels. This task is challenging due to high dimensional features and correlated labels. Multi-label text classifiers need to be carefully regularized to prevent the severe over-fitting in the high dimensional space, and also need to take into account label dependencies in order to make accurate predictions under uncertainty. We demonstrate significant and practical improvement by carefully regularizing the model complexity during training phase, and also regularizing the label search space during prediction phase. Specifically, we regularize the classifier training using Elastic-net (L1+L2) penalty for reducing model complexity/size, and employ early stopping to prevent overfitting. At prediction time, we apply support inference to restrict the search space to label sets encountered in the training set, and F-optimizer GFM to make optimal predictions for the F1 metric. We show that although support inference only provides density estimations on existing label combinations, when combined with GFM predictor, the algorithm can output unseen label combinations. Taken collectively, our experiments show state of the art results on many benchmark datasets. Beyond performance and practical contributions, we make some interesting observations. Contrary to the prior belief, which deems support inference as purely an approximate inference procedure, we show that support inference acts as a strong regularizer on the label prediction structure. It allows the classifier to take into account label dependencies during prediction even if the classifiers had not modeled any label dependencies during training.

主题：	机器学习 (stat.ML) ; 机器学习 (cs.LG)
引用方式：	arXiv:1705.00740 [stat.ML]
	(或者 arXiv:1705.00740v1 [stat.ML] 对于此版本)
	https://doi.org/10.48550/arXiv.1705.00740

提交历史

来自： Cheng Li [查看电子邮件]
[v1] 星期一， 2017 年 5 月 1 日 23:30:13 UTC (3,428 KB)

统计学 > 机器学习

标题：正则化模型复杂度和标签结构用于多标签文本分类

标题： Regularizing Model Complexity and Label Structure for Multi-Label Text Classification

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

统计学 > 机器学习

标题： 正则化模型复杂度和标签结构用于多标签文本分类 显示英文标题

标题： Regularizing Model Complexity and Label Structure for Multi-Label Text Classification

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：正则化模型复杂度和标签结构用于多标签文本分类