zbMATH — the first resource for mathematics

Space-efficient algorithms for document retrieval. (English) Zbl 1138.68401
Ma, Bin (ed.) et al., Combinatorial pattern matching. 18th annual symposium, CPM 2007, London, Canada, July 9–11, 2007. Proceedings. Berlin: Springer (ISBN 978-3-540-73436-9/pbk). Lecture Notes in Computer Science 4580, 205-215 (2007).
Summary: We study the Document Listing problem, where a collection \(D\) of documents \(d _{1},\dots ,d _{k }\) of total length \(\sum _{i } d _{i } = n\) is to be preprocessed, so that one can later efficiently list all the ndoc documents containing a given query pattern \(P\) of length \(m\) as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with \(O(n)\) time preprocessing, one can answer the queries in \(O(m+\text{ndoc})\) time. In this paper, we improve the space-requirement of Muthukrishnan’s solution from \(O(n \log n)\) bits to \(|\text{CSA}| + 2n + n \log k (1 + o(1))\) bits, where \(|\text{CSA}| \leq n \log |\Sigma |(1 + o(1))\) is the size of any suitable compressed suffix array (CSA), and \(\Sigma \) is the underlying alphabet of documents. The time requirement depends on the CSA used, but we can obtain e.g. the optimal \(O(m+\text{ndoc})\) time when \(|\Sigma|\), \(k= O(\text{polylog}(n))\). For general \(|\Sigma |\), \(k\) the time requirement becomes \(O(m \log |\Sigma|+\text{ndoc} \log k)\). Sadakane (ISAAC 2002) has developed a similar space-efficient variant of Muthukrishnan’s solution; we obtain a better time requirement in most cases, but a slightly worse space requirement.
For the entire collection see [Zbl 1136.68008].

68P20 Information storage and retrieval of data
Full Text: DOI