Navarro, Gonzalo; Thankachan, Sharma V.
Bottom-\(k\) document retrieval
J. Discrete Algorithms 32, 69-74 (2015).
2015
68P20 68P05 68W32
compact data structures; document retrieval; string collections
Summary: We consider the problem of retrieving the \(k\) documents from a collection of strings where a given pattern \(P\) appears least often. This has potential applications in data mining, bioinformatics, security, and big data. We show that adapting the classical linear-space solutions for this problem is trivial, but the compressed-space solutions are not easy to extend. We design a new solution for this problem that matches the best-known result when using \(2 | \mathsf{CSA} | + o(n)\) bits, where \(\mathsf{CSA}\) is a compressed suffix array. Our structure answers queries in the time needed by the \(\mathsf{CSA}\) to find the suffix array interval of the pattern plus \(O(k \lg k \lg^\varepsilon n)\) accesses to suffix array cells, for any constant \(\varepsilon > 0\).