an:06740852
Zbl 1370.68075
Munro, J. Ian; Navarro, Gonzalo; Nielsen, Jesper Sindahl; Shah, Rahul; Thankachan, Sharma V.
Top-\(k\) term-proximity in succinct space
EN
Algorithmica 78, No. 2, 379-393 (2017).
00367019
2017
j
68P20 68P05
document indexing; top-\(k\) document retrieval; ranked document retrieval; succinct data structures; compressed data structures; compact data structures; proximity search
Summary: Let \(\mathcal {D} = \{\mathsf {T}_1,\mathsf {T}_2,\dotsc,\mathsf {T}_D\}\) be a collection of \(D\) string documents of \(n\) characters in total, that are drawn from an alphabet set \(\varSigma =[\sigma ]\). The \textit{top-k document retrieval problem} is to preprocess \(\mathcal{D}\) into a data structure that, given a query \((P[1\dotsc p],k)\), can return the \(k\) documents of \(\mathcal{D}\) most relevant to the pattern \(P\). The relevance is captured using a predefined ranking function, which depends on the set of occurrences of \(P\) in \(\mathsf {T}_d\). For example, it can be the term frequency (i.e., the number of occurrences of \(P\) in \(\mathsf {T}_d\)), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of \(P\) in \(\mathsf {T}_d\)), or a pattern-independent importance score of \(\mathsf {T}_d\) such as PageRank. Linear space and optimal query time solutions already exist for the general top-\(k\) document retrieval problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In this paper we present the first sub-linear space data structure for this relevance function, which uses only \(o(n)\) bits on top of any compressed suffix array of \(\mathcal{D}\) and solves queries in \(O((p+k)\, {{\mathrm{polylog}}}\;n)\) time. We also show that scores that consist of a weighted combination of term proximity, term frequency, and document importance, can be handled using twice the space required to represent the text collection.