Space-Efficient k-Mismatch Text Indexes

Kociumaka, Tomasz; Radoszewski, Jakub

计算机科学 > 数据结构与算法

arXiv:2510.26264 (cs)

[提交于 2025年10月30日 ]

标题：节省空间的k不匹配文本索引

标题： Space-Efficient k-Mismatch Text Indexes

Authors:Tomasz Kociumaka, Jakub Radoszewski

摘要：字符串处理中的一个核心任务是文本索引，其目标是将文本（长度为$n$的字符串）预处理成一个高效的索引（数据结构），以支持对文本的查询。 Cole、Gottlieb 和 Lewenstein（STOC 2004）提出了$k$-errata 树，这是一种支持多种类型近似模式匹配查询的文本索引族。特别是，$k$-errata 树为$k$-mismatch 查询提供了一个优雅的解决方案，其中需要报告与查询模式汉明距离最多为$k$的所有文本子串。结果的$k$-不匹配索引使用$O(n\log^k n)$空间，并在$O(\log^k n \log \log n + m + occ)$时间内回答长度为$m$的模式查询，其中$occ$是近似出现次数。回顾过去，$k$-errata 树看起来已经非常优化：尽管在过去二十年中，大量工作将$k$-errata 树适应到各种环境中，但$k$-mismatch 索引的原始时间空间权衡在一般情况下并未得到改进。我们提出了第一个这样的改进，一个具有$k$-mismatch 的索引，使用$O(n\log^{k-1} n)$空间，查询时间与$k$-errata 树相同。以前，由于 Chan、Lam、Sung、Tam 和 Wong（Algorithmica 2010）的一个结果，仅知道对于字母表大小为常数的文本，存在这样的$O(n\log^{k-1} n)$-size 索引。在这种情况下，我们获得了一个更小的$k$-不匹配索引，大小仅为$O(n \log^{k-2+\varepsilon+\frac{2}{k+2-(k \bmod 2)}} n)\subseteq O(n\log^{k-1.5+\varepsilon} n)$对于$2\le k\le O(1)$和任何常数$\varepsilon>0$。在这一过程中，我们还为短模式开发了改进的索引，在这个实际相关的特殊情况下提供了更好的权衡。

摘要： A central task in string processing is text indexing, where the goal is to preprocess a text (a string of length $n$) into an efficient index (a data structure) supporting queries about the text. Cole, Gottlieb, and Lewenstein (STOC 2004) proposed $k$-errata trees, a family of text indexes supporting approximate pattern matching queries of several types. In particular, $k$-errata trees yield an elegant solution to $k$-mismatch queries, where we are to report all substrings of the text with Hamming distance at most $k$ to the query pattern. The resulting $k$-mismatch index uses $O(n\log^k n)$ space and answers a query for a length-$m$ pattern in $O(\log^k n \log \log n + m + occ)$ time, where $occ$ is the number of approximate occurrences. In retrospect, $k$-errata trees appear very well optimized: even though a large body of work has adapted $k$-errata trees to various settings throughout the past two decades, the original time-space trade-off for $k$-mismatch indexing has not been improved in the general case. We present the first such improvement, a $k$-mismatch index with $O(n\log^{k-1} n)$ space and the same query time as $k$-errata trees. Previously, due to a result of Chan, Lam, Sung, Tam, and Wong (Algorithmica 2010), such an $O(n\log^{k-1} n)$-size index has been known only for texts over alphabets of constant size. In this setting, however, we obtain an even smaller $k$-mismatch index of size only $O(n \log^{k-2+\varepsilon+\frac{2}{k+2-(k \bmod 2)}} n)\subseteq O(n\log^{k-1.5+\varepsilon} n)$ for $2\le k\le O(1)$ and any constant $\varepsilon>0$. Along the way, we also develop improved indexes for short patterns, offering better trade-offs in this practically relevant special case.

评论：	SODA 2026
主题：	数据结构与算法 (cs.DS)
引用方式：	arXiv:2510.26264 [cs.DS]
	(或者 arXiv:2510.26264v1 [cs.DS] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.26264

提交历史

来自： Jakub Radoszewski [查看电子邮件]
[v1] 星期四， 2025 年 10 月 30 日 08:45:00 UTC (49 KB)

计算机科学 > 数据结构与算法

标题：节省空间的k不匹配文本索引

标题： Space-Efficient k-Mismatch Text Indexes

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 数据结构与算法

标题： 节省空间的k不匹配文本索引 显示英文标题

标题： Space-Efficient k-Mismatch Text Indexes

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题：节省空间的k不匹配文本索引