Efficient Direct-Access Ranked Retrieval

Dehghankar, Mohsen; Mittal, Raghav; Shetiya, Suraj; Asudeh, Abolfazl; Das, Gautam

计算机科学 > 数据结构与算法

arXiv:2508.01108 (cs)

[提交于 2025年8月1日 ]

标题：高效直接访问的排序检索

标题： Efficient Direct-Access Ranked Retrieval

Authors:Mohsen Dehghankar, Raghav Mittal, Suraj Shetiya, Abolfazl Asudeh, Gautam Das

摘要：我们研究了交互式数据工具中的直接访问排序检索（DAR）问题，其中不断变化的数据探索实践，结合大规模和高维数据集，带来了新的挑战。 DAR关注的是根据排序函数实现对任意排名位置的高效访问，而无需枚举所有前面的元组。为了解决这一需求，我们形式化了DAR问题，并提出了一种基于几何排列的理论高效算法，实现了对数查询时间。然而，这种方法在高维情况下存在指数级的空间复杂度。因此，我们开发了第二类基于$\varepsilon$采样的算法，其空间消耗呈线性增长。由于精确定位特定排名的元组具有挑战性，这与其与范围计数问题的关联有关，我们引入了一个称为一致集排序检索（CSR）的松弛变体，该变体返回一个保证包含目标元组的小子集。为了高效解决CSR问题，我们定义了一个中间问题，条带范围检索（SRR），并设计了一个针对窄范围查询的分层采样数据结构。我们的方法在数据规模和维度上都实现了实际的可扩展性。我们证明了算法效率的近似最优界限，并通过在真实和合成数据集上的大量实验验证了其性能，展示了对数百万元组和数百维度的可扩展性。

摘要： We study the problem of Direct-Access Ranked Retrieval (DAR) for interactive data tooling, where evolving data exploration practices, combined with large-scale and high-dimensional datasets, create new challenges. DAR concerns the problem of enabling efficient access to arbitrary rank positions according to a ranking function, without enumerating all preceding tuples. To address this need, we formalize the DAR problem and propose a theoretically efficient algorithm based on geometric arrangements, achieving logarithmic query time. However, this method suffers from exponential space complexity in high dimensions. Therefore, we develop a second class of algorithms based on $\varepsilon$-sampling, which consume a linear space. Since exactly locating the tuple at a specific rank is challenging due to its connection to the range counting problem, we introduce a relaxed variant called Conformal Set Ranked Retrieval (CSR), which returns a small subset guaranteed to contain the target tuple. To solve the CSR problem efficiently, we define an intermediate problem, Stripe Range Retrieval (SRR), and design a hierarchical sampling data structure tailored for narrow-range queries. Our method achieves practical scalability in both data size and dimensionality. We prove near-optimal bounds on the efficiency of our algorithms and validate their performance through extensive experiments on real and synthetic datasets, demonstrating scalability to millions of tuples and hundreds of dimensions.

主题：	数据结构与算法 (cs.DS) ; 计算几何 (cs.CG); 数据库 (cs.DB)
引用方式：	arXiv:2508.01108 [cs.DS]
	(或者 arXiv:2508.01108v1 [cs.DS] 对于此版本)
	https://doi.org/10.48550/arXiv.2508.01108

提交历史

来自： Mohsen Dehghankar [查看电子邮件]
[v1] 星期五， 2025 年 8 月 1 日 23:03:42 UTC (3,980 KB)

计算机科学 > 数据结构与算法

标题：高效直接访问的排序检索

标题： Efficient Direct-Access Ranked Retrieval

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 数据结构与算法

标题： 高效直接访问的排序检索 显示英文标题

标题： Efficient Direct-Access Ranked Retrieval

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：高效直接访问的排序检索