zbMATH — the first resource for mathematics

Document listing on repetitive collections. (English) Zbl 1381.68076
Fischer, Johannes (ed.) et al., Combinatorial pattern matching. 24th annual symposium, CPM 2013, Bad Herrenalb, Germany, June 17–19, 2013. Proceedings. Berlin: Springer (ISBN 978-3-642-38904-7/pbk). Lecture Notes in Computer Science 7922, 107-119 (2013).
Summary: Many document collections consist largely of repeated material, and several indexes have been designed to take advantage of this. There has been only preliminary work, however, on document retrieval for repetitive collections. In this paper we show how one of those indexes, the run-length compressed suffix array (RLCSA), can be extended to support document listing. In our experiments, our additional structures on top of the RLCSA can reduce the query time for document listing by an order of magnitude while still using total space that is only a fraction of the raw collection size. As a byproduct, we develop a new document listing technique for general collections that is of independent interest.
For the entire collection see [Zbl 1264.68004].

68P20 Information storage and retrieval of data
68P05 Data structures
68W32 Algorithms on strings
Full Text: DOI