zbMATH — the first resource for mathematics

Space-efficient algorithms for document retrieval. (English) Zbl 1138.68401
Ma, Bin (ed.) et al., Combinatorial pattern matching. 18th annual symposium, CPM 2007, London, Canada, July 9–11, 2007. Proceedings. Berlin: Springer (ISBN 978-3-540-73436-9/pbk). Lecture Notes in Computer Science 4580, 205-215 (2007).
Summary: We study the Document Listing problem, where a collection $$D$$ of documents $$d _{1},\dots ,d _{k }$$ of total length $$\sum _{i } d _{i } = n$$ is to be preprocessed, so that one can later efficiently list all the ndoc documents containing a given query pattern $$P$$ of length $$m$$ as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with $$O(n)$$ time preprocessing, one can answer the queries in $$O(m+\text{ndoc})$$ time. In this paper, we improve the space-requirement of Muthukrishnan’s solution from $$O(n \log n)$$ bits to $$|\text{CSA}| + 2n + n \log k (1 + o(1))$$ bits, where $$|\text{CSA}| \leq n \log |\Sigma |(1 + o(1))$$ is the size of any suitable compressed suffix array (CSA), and $$\Sigma$$ is the underlying alphabet of documents. The time requirement depends on the CSA used, but we can obtain e.g. the optimal $$O(m+\text{ndoc})$$ time when $$|\Sigma|$$, $$k= O(\text{polylog}(n))$$. For general $$|\Sigma |$$, $$k$$ the time requirement becomes $$O(m \log |\Sigma|+\text{ndoc} \log k)$$. Sadakane (ISAAC 2002) has developed a similar space-efficient variant of Muthukrishnan’s solution; we obtain a better time requirement in most cases, but a slightly worse space requirement.
For the entire collection see [Zbl 1136.68008].

MSC:
 68P20 Information storage and retrieval of data
Full Text: