an:07152174
Zbl 1436.68083
Ganguly, Arnab; Shah, Rahul; Thankachan, Sharma V.
Succinct non-overlapping indexing
EN
Algorithmica 82, No. 1, 107-117 (2020).
00444083
2020
j
68P05 68P15 68W32
succinct data structures; range queries; suffix trees; string algorithms
Summary: Text indexing is a fundamental problem in computer science. The objective is to preprocess a text \(T\), so that, given a pattern \(P\), we can find all starting positions (or simply, occurrences) of \(P\) in \(T\) efficiently. In some cases, additional restrictions are imposed. We consider two variants, namely the \textit{non-overlapping indexing} problem, and the \textit{range non-overlapping indexing} problem. Given a text \(T\) having \(n\) characters, the non-overlapping indexing problem is defined as follows: pre-process \(T\) into a data structure, such that for any pattern \(P\), containing \(|P|\) characters, we can report a set containing the maximum number of non-overlapping occurrences of \(P\) in \(T\).
\textit{H. Cohen} and \textit{E. Porat} [Lect. Notes Comput. Sci. 5878, 1044--1053 (2009; Zbl 1273.68097)] showed that by maintaining a linear space index in which the suffix tree of \(T\) is augmented with an \(O(n)\) word data structure, a query \(P\) can be answered in optimal time \(O(|P|+\mathrm{nocc})\), where nocc is the number of occurrences reported. We present the following new result. Let \(\mathsf{CSA} \) (not necessarily a compressed suffix array) be an index of \(T\) that can compute (i) the suffix range of \(P\) in \(\mathsf{search}(P)\) time, and (ii) a suffix array or an inverse suffix array value in \(\mathsf{t}_\mathsf{SA}\) time. By using \(\mathsf{CSA}\) alone, we can answer a query \(P\) in \(\mathsf{search}(P)+\mathsf{sort}(\mathrm{nocc})+O(\mathrm{nocc}\cdot \mathsf{t}_\mathsf{SA})\) time. The function \(\mathsf{sort}(k)\) denotes the time for sorting \(k\) numbers in \(\{1,2,\dots ,n\} \). In the range non-overlapping indexing problem, along with the pattern \(P\), two integers \(a\) and \(b, b \ge a\), are provided as input. The task is to report a set containing the maximum number of non-overlapping occurrences of \(P\) that lie within the range \([a, b]\). For any arbitrarily small positive constant \(\epsilon \), we present an \(O(n \log^\epsilon n)\) word index with \(O(|P| + \mathrm{nocc}_{a,b})\) query time, where \(\mathrm{nocc}_{a,b}\) is the number of occurrences reported. Our index improves upon the result of
Cohen and Porat [loc. cit.].
Zbl 1273.68097