zbMATH — the first resource for mathematics

Forbidden extension queries. (English) Zbl 1366.68029
Harsha, Prahladh (ed.) et al., 35th IARCS annual conference on foundations of software technology and theoretical computer science, FSTTCS 2015, Bangalore, India, December 16–18, 2015. Proceedings. Wadern: Schloss Dagstuhl – Leibniz Zentrum für Informatik (ISBN 978-3-939897-97-2). LIPIcs – Leibniz International Proceedings in Informatics 45, 320-335 (2015).
Summary: Document retrieval is one of the most fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbidden extensions.
Let \(\mathcal D=\{\mathsf T_1,\mathsf T_2,\dots ,\mathsf T_D\}\) be a collection of \(D\) string documents of \(n\) characters in total, and \(P^+\) and \(P^-\) be two query patterns, where \(P^+\) is a proper prefix of \(P^-\). We call \(P^-\) as the forbidden extension of the included pattern \(P^+\). A forbidden extension query \(\langle P^+,P^-\rangle\) asks to report all \(occ\) documents in \(\mathcal D\) that contains \(P^+\) as a substring, but does not contain \(P^-\) as one. A \(\mathsf{top}\)-\(k\) forbidden extension query \(\langle P^+,P^-,k \rangle\) asks to report those \(k\) documents among the occ documents that are most relevant to \(P^+\). We present a linear index (in words) with an \(O(|P^-| + \mathrm{occ})\) query time for the document listing problem. For the \(\mathsf {top}\)-\(k\) version of the problem, we achieve the following results, when the relevance of a document is based on PageRank:
– an \(O(n)\) space (in words) index with \(O(|P^-|\log \sigma+ k)\) query time, where \(\sigma\) is the size of the alphabet from which characters in \(\mathcal D\) are chosen. For constant alphabets, this yields an optimal query time of \(O(|P^-|+ k)\).
– for any constant \(\epsilon > 0\), a \(|\mathsf {CSA}| + |\mathsf {CSA}^*| + D\log \frac{n}{D} + O(n)\) bits index with \(O(\mathsf {search}(P)+ k \cdot \mathsf{t}_{\mathsf {SA}} \cdot \log ^{2+\epsilon} n)\) query time, where \(\mathsf {search}(P)\) is the time to find the suffix range of a pattern \(P\), \(\mathsf {t}_{\mathsf{SA}}\) is the time to find suffix (or inverse suffix) array value, and \(|\mathsf {CSA}^*|\) denotes the maximum of the space needed to store the compressed suffix array \(\mathsf {CSA}\) of the concatenated text of all documents, or the total space needed to store the individual \(\mathsf {CSA}\) of each document.
For the entire collection see [Zbl 1338.68006].

68P05 Data structures
68P20 Information storage and retrieval of data
Full Text: DOI