One Permutation Hashing for Efficient Search and Learning

Li, Ping; Owen, Art; Zhang, Cun-Hui

计算机科学 > 机器学习

arXiv:1208.1259 (cs)

[提交于 2012年8月6日 ]

标题：一种排列哈希用于高效搜索和学习

标题： One Permutation Hashing for Efficient Search and Learning

Authors:Ping Li, Art Owen, Cun-Hui Zhang

摘要：最近，b位最小值哈希方法已被应用于大规模线性学习和子线性时间近邻搜索。最小值哈希的主要缺点是昂贵的预处理成本，因为该方法需要在数据上应用（例如）k=200到500次排列。如果新的数据点（例如，新的文档或图像）未被处理，则测试时间也可能很昂贵，这在面向用户的应用程序中可能是一个重大问题。我们基于单排列哈希开发了一个非常简单的解决方案。从概念上讲，给定一个巨大的二进制数据矩阵，我们只对列进行一次排列，并将排列后的列均匀地分成k个桶；对于每个数据向量，我们简单地存储每个桶中最小的非零位置。有趣的概率分析（通过实验验证）表明，我们的单排列方案应该与原始的（k-排列）最小值哈希表现得非常相似。事实上，由于“无放回抽样”的效果，单排列方案甚至可以稍微更准确一些。我们在webspam数据集上使用线性SVM和支持向量机逻辑回归的实验表明，这个单排列哈希方案可以实现与原始k-排列方案相同（甚至略微更好）的准确性。为了测试我们方法的鲁棒性，我们还使用了非常稀疏的小型news20数据集，每个数据向量平均仅有大约500个非零元素。有趣的是，在news20数据集上，当k不太小时，我们的单排列方案明显优于k-排列方案。总之，我们的方法可以在仅为原始预处理成本的1/k的情况下实现至少与原始k-排列方案相同的精度。

摘要： Recently, the method of b-bit minwise hashing has been applied to large-scale linear learning and sublinear time near-neighbor search. The major drawback of minwise hashing is the expensive preprocessing cost, as the method requires applying (e.g.,) k=200 to 500 permutations on the data. The testing time can also be expensive if a new data point (e.g., a new document or image) has not been processed, which might be a significant issue in user-facing applications. We develop a very simple solution based on one permutation hashing. Conceptually, given a massive binary data matrix, we permute the columns only once and divide the permuted columns evenly into k bins; and we simply store, for each data vector, the smallest nonzero location in each bin. The interesting probability analysis (which is validated by experiments) reveals that our one permutation scheme should perform very similarly to the original (k-permutation) minwise hashing. In fact, the one permutation scheme can be even slightly more accurate, due to the "sample-without-replacement" effect. Our experiments with training linear SVM and logistic regression on the webspam dataset demonstrate that this one permutation hashing scheme can achieve the same (or even slightly better) accuracies compared to the original k-permutation scheme. To test the robustness of our method, we also experiment with the small news20 dataset which is very sparse and has merely on average 500 nonzeros in each data vector. Interestingly, our one permutation scheme noticeably outperforms the k-permutation scheme when k is not too small on the news20 dataset. In summary, our method can achieve at least the same accuracy as the original k-permutation scheme, at merely 1/k of the original preprocessing cost.

主题：	机器学习 (cs.LG) ; 信息检索 (cs.IR); 信息论 (cs.IT); 计算 (stat.CO); 机器学习 (stat.ML)
引用方式：	arXiv:1208.1259 [cs.LG]
	(或者 arXiv:1208.1259v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.1208.1259

提交历史

来自： Ping Li [查看电子邮件]
[v1] 星期一， 2012 年 8 月 6 日 12:28:06 UTC (194 KB)

计算机科学 > 机器学习

标题：一种排列哈希用于高效搜索和学习

标题： One Permutation Hashing for Efficient Search and Learning

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： 一种排列哈希用于高效搜索和学习 显示英文标题

标题： One Permutation Hashing for Efficient Search and Learning

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：一种排列哈希用于高效搜索和学习