Nearly-Tight and Oblivious Algorithms for Explainable Clustering

Gamlath, Buddhima; Jia, Xinrui; Polak, Adam; Svensson, Ola

Computer Science > Data Structures and Algorithms

arXiv:2106.16147v2 (cs)

[Submitted on 30 Jun 2021 (v1) , last revised 24 Oct 2021 (this version, v2)]

Title: Nearly-Tight and Oblivious Algorithms for Explainable Clustering

Title: 几乎紧致且无偏的可解释聚类算法

Authors:Buddhima Gamlath, Xinrui Jia, Adam Polak, Ola Svensson

Abstract: We study the problem of explainable clustering in the setting first formalized by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). A $k$-clustering is said to be explainable if it is given by a decision tree where each internal node splits data points with a threshold cut in a single dimension (feature), and each of the $k$ leaves corresponds to a cluster. We give an algorithm that outputs an explainable clustering that loses at most a factor of $O(\log^2 k)$ compared to an optimal (not necessarily explainable) clustering for the $k$-medians objective, and a factor of $O(k \log^2 k)$ for the $k$-means objective. This improves over the previous best upper bounds of $O(k)$ and $O(k^2)$, respectively, and nearly matches the previous $\Omega(\log k)$ lower bound for $k$-medians and our new $\Omega(k)$ lower bound for $k$-means. The algorithm is remarkably simple. In particular, given an initial not necessarily explainable clustering in $\mathbb{R}^d$, it is oblivious to the data points and runs in time $O(dk \log^2 k)$, independent of the number of data points $n$. Our upper and lower bounds also generalize to objectives given by higher $\ell_p$-norms.

Abstract: 我们研究了由Dasgupta、Frost、Moshkovitz和Rashtchian（ICML 2020）首次形式化设置中的可解释聚类问题。一个$k$-聚类被认为是可解释的，如果它由一个决策树给出，其中每个内部节点在单个维度（特征）上使用阈值切割来分割数据点，并且$k$个叶子对应一个聚类。我们给出一个算法，输出的可解释聚类与最优（不一定可解释）聚类相比，在$k$-中位数目标上最多损失一个因子$O(\log^2 k)$，在$k$-均值目标上最多损失一个因子$O(k \log^2 k)$。这改进了之前的最佳上界$O(k)$和$O(k^2)$，分别并且几乎匹配了之前针对$k$中位数的$\Omega(\log k)$下界以及我们针对$\Omega(k)$新的下界$k$均值。该算法非常简单。特别是，给定一个初始的不一定可解释的聚类在$\mathbb{R}^d$中，它对数据点是不可见的，并在时间$O(dk \log^2 k)$内运行，与数据点的数量$n$无关。我们的上下界也推广到由更高$\ell_p$-范数给出的目标函数。

Subjects:	Data Structures and Algorithms (cs.DS) ; Machine Learning (cs.LG)
Cite as:	arXiv:2106.16147 [cs.DS]
	(or arXiv:2106.16147v2 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.2106.16147

Submission history

From: Adam Polak [view email]
[v1] Wed, 30 Jun 2021 15:49:41 UTC (40 KB)
[v2] Sun, 24 Oct 2021 22:45:48 UTC (42 KB)

Computer Science > Data Structures and Algorithms

Title: Nearly-Tight and Oblivious Algorithms for Explainable Clustering

Title: 几乎紧致且无偏的可解释聚类算法

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title: Nearly-Tight and Oblivious Algorithms for Explainable Clustering Show Chinese title

Title: 几乎紧致且无偏的可解释聚类算法

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Title: Nearly-Tight and Oblivious Algorithms for Explainable Clustering