Sampling-Based Estimation of Jaccard Containment and Similarity

Joshi, Pranav

统计学 > 计算

arXiv:2507.10019 (stat)

[提交于 2025年7月14日 (v1) ，最后修订 2025年7月17日 (此版本， v2)]

标题：基于采样的Jaccard包含和相似度估计

标题： Sampling-Based Estimation of Jaccard Containment and Similarity

Authors:Pranav Joshi

摘要：本文解决了仅使用每个集合的随机样本估计两个集合之间的包含关系和相似性的难题，而无需依赖草图或完整数据访问。该研究引入了一个二项式模型来预测样本之间的重叠，证明当样本量相对于原始集合较小时，该模型既准确又实用。本文将该模型与之前的方法进行了比较，并展示了在所考虑的条件下它能提供更好的估计。它还分析了估计量的统计特性，包括误差界限以及达到所需精度和置信水平所需的数据量。该框架被扩展用于估计集合相似性，并且本文提供了在仅可用部分或采样数据的大规模数据系统中应用这些方法的指导。

摘要： This paper addresses the problem of estimating the containment and similarity between two sets using only random samples from each set, without relying on sketches or full data access. The study introduces a binomial model for predicting the overlap between samples, demonstrating that it is both accurate and practical when sample sizes are small compared to the original sets. The paper compares this model to previous approaches and shows that it provides better estimates under the considered conditions. It also analyzes the statistical properties of the estimator, including error bounds and sample size requirements needed to achieve a desired level of accuracy and confidence. The framework is extended to estimate set similarity, and the paper provides guidance for applying these methods in large scale data systems where only partial or sampled data is available.

主题：	计算 (stat.CO) ; 数据库 (cs.DB); 机器学习 (stat.ML)
引用方式：	arXiv:2507.10019 [stat.CO]
	(或者 arXiv:2507.10019v2 [stat.CO] 对于此版本)
	https://doi.org/10.48550/arXiv.2507.10019

提交历史

来自： Pranav Joshi [查看电子邮件]
[v1] 星期一， 2025 年 7 月 14 日 07:56:29 UTC (507 KB)
[v2] 星期四， 2025 年 7 月 17 日 06:08:24 UTC (546 KB)

统计学 > 计算

标题：基于采样的Jaccard包含和相似度估计

标题： Sampling-Based Estimation of Jaccard Containment and Similarity

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

统计学 > 计算

标题： 基于采样的Jaccard包含和相似度估计 显示英文标题

标题： Sampling-Based Estimation of Jaccard Containment and Similarity

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：基于采样的Jaccard包含和相似度估计