Regularized k-POD: Sparse k-means clustering for high-dimensional missing data

Guan, Xin; Terada, Yoshikazu

统计学 > 方法论

arXiv:2507.11884v1 (stat)

[提交于 2025年7月16日 ]

标题：正则化k-POD：高维缺失数据的稀疏k-均值聚类

标题： Regularized k-POD: Sparse k-means clustering for high-dimensional missing data

Authors:Xin Guan, Yoshikazu Terada

摘要：经典的k均值聚类方法，基于所有数据特征计算的距离，不能直接应用于存在缺失值的不完整数据。 k-POD是k均值到缺失数据的一种自然扩展，它仅使用观测到的条目进行聚类，计算效率高且灵活。然而，对于高维缺失数据，包括与潜在聚类结构无关的特征，这些无关特征的存在会导致k-POD在估计聚类中心时产生偏差，从而损害其聚类效果。尽管如此，现有的k-POD方法在低维情况下表现良好，突显了解决偏差问题的重要性。为此，本文提出了一种正则化的k-POD聚类方法，将特征级别的正则化应用于现有k-POD聚类中的聚类中心。对聚类中心的这种惩罚使得我们能够有效减少k-POD在高维缺失数据中的偏差。据我们所知，我们的方法是第一个在保留计算效率和灵活性的同时，减轻高维缺失数据中k均值类型聚类偏差的方法。模拟结果验证了所提出的方法能有效减少偏差并提高聚类性能。在真实世界单细胞RNA测序数据中的应用进一步展示了所提出方法的实用性。

摘要： The classical k-means clustering, based on distances computed from all data features, cannot be directly applied to incomplete data with missing values. A natural extension of k-means to missing data, namely k-POD, uses only the observed entries for clustering and is both computationally efficient and flexible. However, for high-dimensional missing data including features irrelevant to the underlying cluster structure, the presence of such irrelevant features leads to the bias of k-POD in estimating cluster centers, thereby damaging its clustering effect. Nevertheless, the existing k-POD method performs well in low-dimensional cases, highlighting the importance of addressing the bias issue. To this end, in this paper, we propose a regularized k-POD clustering method that applies feature-wise regularization on cluster centers into the existing k-POD clustering. Such a penalty on cluster centers enables us to effectively reduce the bias of k-POD for high-dimensional missing data. To the best of our knowledge, our method is the first to mitigate bias in k-means-type clustering for high-dimensional missing data, while retaining the computational efficiency and flexibility. Simulation results verify that the proposed method effectively reduces bias and improves clustering performance. Applications to real-world single-cell RNA sequencing data further show the utility of the proposed method.

主题：	方法论 (stat.ME)
引用方式：	arXiv:2507.11884 [stat.ME]
	(或者 arXiv:2507.11884v1 [stat.ME] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.11884

提交历史

来自： Xin Guan [查看电子邮件]
[v1] 星期三， 2025 年 7 月 16 日 03:50:20 UTC (1,982 KB)

统计学 > 方法论

标题：正则化k-POD：高维缺失数据的稀疏k-均值聚类

标题： Regularized k-POD: Sparse k-means clustering for high-dimensional missing data

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

统计学 > 方法论

标题： 正则化k-POD：高维缺失数据的稀疏k-均值聚类 显示英文标题

标题： Regularized k-POD: Sparse k-means clustering for high-dimensional missing data

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：正则化k-POD：高维缺失数据的稀疏k-均值聚类