an:07300875
Zbl 07300875
Hooshmand, Sahar; Abedin, Paniz; K??lekci, M. O??uzhan; Thankachan, Sharma V.
I/O-efficient data structures for non-overlapping indexing
EN
Theor. Comput. Sci. 857, 1-7 (2021).
00457516
2021
j
68Q
suffix trees; data structure; string algorithms
Summary: The non-overlapping indexing problem is defined as follows: pre-process a given text \(\mathsf{T} [1, n]\) of length \(n\) into a data structure such that whenever a pattern \(P [1, m]\) comes as an input, we can efficiently report the largest set of non-overlapping occurrences of \(P\) in \(\mathsf{T}\). The best-known solution is by Cohen and Porat [ISAAC 2009]. The size of their structure is \(O(n)\) words and the query time is optimal \(O (m + \mathsf{nocc})\), where \(\mathsf{nocc}\) is the output size. Later, Ganguly et al. [CPM 2015 and Algorithmica 2020] proposed a compressed space solution. We study this problem in the cache-oblivious model and present a new data structure of size \(O(n \log n)\) words. It can answer queries in optimal \(O(\frac{ m}{ B} + \log_B n + \frac{\mathsf{nocc}}{B}) I/O\) operations, where \(B\) is the block size. The space can be improved to \(O(n \log_{M / B} n)\) in the cache-aware model, where \(M\) is the size of main memory. Additionally, we study a generalization of this problem with an additional range \([s, e]\) constraint. Here the task is to report the largest set of non-overlapping occurrences of \(P\) in \(\mathsf{T}\), that are within the range \([s, e]\). We present an \(O(n \log^2 n)\) space data structure in the cache-aware model that can answer queries in optimal \(O (\frac{m}{B} + \log_B n + \frac{\mathsf{nocc}_{[s, e]}}{B}) I/O\) operations, where \(\mathsf{nocc}_{[s, e]}\) is the output size.