Skip to main content
CenXiv.org
This website is in trial operation, support us!
We gratefully acknowledge support from all contributors.
Contribute
Donate
cenxiv logo > cs > arXiv:2506.17613

Help | Advanced Search

Computer Science > Data Structures and Algorithms

arXiv:2506.17613 (cs)
[Submitted on 21 Jun 2025 ]

Title: Contextual Pattern Mining and Counting

Title: 上下文模式挖掘与计数

Authors:Ling Li, Daniel Gibney, Sharma V. Thankachan, Solon P. Pissis, Grigorios Loukides
Abstract: Given a string $P$ of length $m$, a longer string $T$ of length $n>m$, and two integers $l\geq 0$ and $r\geq 0$, the context of $P$ in $T$ is the set of all string pairs $(L,R)$, with $|L|=l$ and $|R|=r$, such that the string $LPR$ occurs in $T$. We introduce two problems related to the notion of context: (1) the Contextual Pattern Mining (CPM) problem, which given $T$, $(m,l,r)$, and an integer $\tau>0$, asks for outputting the context of each substring $P$ of length $m$ of $T$, provided that the size of the context of $P$ is at least $\tau$; and (2) the Contextual Pattern Counting (CPC) problem, which asks for preprocessing $T$ so that the size of the context of a given query string $P$ of length $m$ can be found efficiently. For CPM, we propose a linear-work algorithm that either uses only internal memory, or a bounded amount of internal memory and external memory, which allows much larger datasets to be handled. For CPC, we propose an $\widetilde{\mathcal{O}}(n)$-space index that can be constructed in $\widetilde{\mathcal{O}}n)$ time and answers queries in $\mathcal{O}(m)+\widetilde{\mathcal{O}}(1)$ time. We further improve the practical performance of the CPC index by optimizations that exploit the LZ77 factorization of $T$ and an upper bound on the query length. Using billion-letter datasets from different domains, we show that the external memory version of our CPM algorithm can deal with very large datasets using a small amount of internal memory while its runtime is comparable to that of the internal memory version. Interestingly, we also show that our optimized index for CPC outperforms an approach based on the state of the art for the reporting version of CPC [Navarro, SPIRE 2020] in terms of query time, index size, construction time, and construction space, often by more than an order of magnitude.
Abstract: 给定一个长度为$m$的字符串$P$,一个长度为$n>m$的更长的字符串$T$,以及两个整数$l\geq 0$和$r\geq 0$,在$T$中$P$的上下文是所有字符串对$(L,R)$,其中$|L|=l$和$|R|=r$,使得字符串$LPR$出现在$T$中。 我们介绍两个与上下文概念相关的问题:(1)上下文模式挖掘(CPM)问题,给定 $T$、$(m,l,r)$ 和整数 $\tau>0$,要求输出 $T$ 中每个长度为 $m$ 的子字符串 $P$ 的上下文,前提是 $P$ 的上下文大小至少为 $\tau$;(2)上下文模式计数(CPC)问题,要求对 $T$ 进行预处理,以便可以有效地找到给定长度为 $m$ 的查询字符串 $P$ 的上下文大小。 对于CPM,我们提出了一种线性工作量的算法,该算法仅使用内部内存,或者使用有限的内部内存和外部内存,这使得可以处理更大规模的数据集。 对于CPC,我们提出了一种$\widetilde{\mathcal{O}}(n)$-空间的索引,该索引可以在$\widetilde{\mathcal{O}}n)$时间内构建,并在$\mathcal{O}(m)+\widetilde{\mathcal{O}}(1)$时间内回答查询。 我们通过优化方法进一步提高了CPC索引的实用性,这些方法利用了$T$的LZ77分解以及查询长度的上限。 使用来自不同领域的十亿字母数据集,我们表明我们CPM算法的外部内存版本可以在使用少量内部内存的情况下处理非常大的数据集,同时其运行时间与内部内存版本的运行时间相当。 有趣的是,我们还表明,我们的CPC优化索引在查询时间、索引大小、构建时间和构建空间方面优于基于当前最佳技术的CPC报告版本的方法 [Navarro, SPIRE 2020],通常要好一个数量级以上。
Comments: 27 pages, 15 figures
Subjects: Data Structures and Algorithms (cs.DS) ; Databases (cs.DB)
Cite as: arXiv:2506.17613 [cs.DS]
  (or arXiv:2506.17613v1 [cs.DS] for this version)
  https://doi.org/10.48550/arXiv.2506.17613
arXiv-issued DOI via DataCite

Submission history

From: Solon Pissis [view email]
[v1] Sat, 21 Jun 2025 06:40:32 UTC (2,900 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled
  • View Chinese PDF
  • View PDF
  • HTML (experimental)
  • TeX Source
  • Other Formats
license icon view license
Current browse context:
cs.DS
< prev   |   next >
new | recent | 2025-06
Change to browse by:
cs
cs.DB

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar
a export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack

京ICP备2025123034号