×

zbMATH — the first resource for mathematics

Colored range queries and document retrieval. (English) Zbl 1292.68045
Summary: Colored range queries are a well-studied topic in computational geometry and database research that, in the past decade, have found exciting applications in information retrieval. In this paper, we give improved time and space bounds for three important one-dimensional colored range queries – colored range listing, colored range top-\(k\) queries and colored range counting – and, as a consequence, new bounds for various document retrieval problems on general collections of sequences. Colored range listing is the problem of preprocessing a sequence \(S[1,n]\) of colors so that, later, given an interval \([i,i+\ell -1]\), we list the different colors in \(S[i,i+\ell-1]\). Colored range top-\(k\) queries ask instead for \(k\) most frequent colors in the interval. Colored range counting asks for the number of different colors in the interval.
We first describe a framework including almost all recent results on colored range listing and document listing, which suggests new combinations of data structures for these problems. For example, we give the first compressed data structure (using \(nH_k(S)+o(n\log\sigma)\) bits, for any \(k=o(\log_\sigma n)\), where \(H_k(S)\) is the \(k\)-th order empirical entropy of \(S\) and \(\sigma\) the number of different colors in \(S\)) that answers colored range listing queries in constant time per returned result. We also give an efficient data structure for document listing whose size is bounded in terms of the \(k\)-th order entropy of the library of documents. We then show how (approximate) colored top-\(k\) queries can be reduced to (approximate) range-mode queries on subsequences, yielding the first efficient data structure for this problem. Finally, we show how modified wavelet trees can support colored range counting using \(nH_0(S)+\mathcal O(n)+o(nH_0(S))\) bits, and answer queries in \(\mathcal O(\log\ell)\) time. As far as we know, this is the first data structure in which the query time depends only on \(\ell\) and not on \(n\). We also show how our data structure can be made dynamic.

MSC:
68P20 Information storage and retrieval of data
68P05 Data structures
Software:
PATRICIA
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Apostolico, A., The myriad virtues of subword trees, (Combinatorial Algorithms on Words, (1985), Springer-Verlag), 85-96 · Zbl 0572.68067
[2] Baeza-Yates, R., Applications of web query mining, (Proceedings of the 27th European Conference on IR Research, (2005), Springer), 7-22
[3] Baeza-Yates, R.; Ribeiro, B., Modern information retrieval, (1999), Addison-Wesley
[4] Barbay, J.; Gagie, T.; Navarro, G.; Nekrich, Y., Alphabet partitioning for compressed rank/select with applications, (Proceedings of the 21st International Symposium on Algorithms and Computations, (2010), Springer), 315-326 · Zbl 1310.68060
[5] Barbay, J.; He, M.; Munro, J. I.; Rao, S. S., Succinct indexes for strings, binary relations and multi-labelled trees, (Proceedings of the 18th Symposium on Discrete Algorithms, (2007), SIAM), 680-689 · Zbl 1302.68097
[6] Belazzougui, D.; Navarro, G., Improved compressed indexes for full-text document retrieval, (Proceedings of the 18th Symposium on String Processing and Information Retrieval, (2011), Springer), 386-397
[7] Belazzougui, D.; Navarro, G., New lower and upper bounds for representing sequences, (Proceedings of the 20th Annual European Symposium on Algorithms, LNCS, vol. 7501, (2012), Springer), 181-192 · Zbl 1365.68260
[8] Bille, P.; Landau, G. M.; Raman, R.; Sadakane, K.; Satti, S. R.; Weimann, O., Random access to grammar-compressed strings, (Proceedings of the 22nd Symposium on Discrete Algorithms, (2011), SIAM), 373-389 · Zbl 1375.68229
[9] Bozanis, P.; Kitsios, N.; Makris, C.; Tsakalidis, A. K., New upper bounds for generalized intersection searching problems, (Proceedings of the 22nd International Colloquium on Algorithms, Languages and Programming, (1995), Springer), 464-474 · Zbl 1412.68287
[10] Brodal, G. S.; Gfeller, B.; Jørgensen, A. G.; Sanders, P., Towards optimal range medians, Theoretical Computer Science, 412, 2588-2601, (2011) · Zbl 1220.68052
[11] Carlsson, S.; Munro, J. I.; Poblete, P. V., An implicit binomial queue with constant insertion time, (Proceedings of the 1st Scandinavian Workshop on Algorithm Theory, (1988), Springer), 1-13 · Zbl 0651.68037
[12] Chan, T.; Durocher, S.; Larsen, K. G.; Morrison, J.; Wilkinson, B. T., Linear-space data structures for range mode query in arrays, (Proceedings of the 29th Symposium on Theoretical Aspects of Computer Science, Leibnitz Zentrum für Informatik, (2012)), 290-301 · Zbl 1245.68071
[13] Culpepper, J. S.; Navarro, G.; Puglisi, S. J.; Turpin, A., Top-\(k\) ranked document search in general text databases, (Proceedings of the 18th European Symposium on Algorithms, (2010), Springer), 194-205 · Zbl 1287.68035
[14] Ferragina, P.; Manzini, G.; Mäkinen, V.; Navarro, G., Compressed representations of sequences and full-text indexes, ACM Transactions on Algorithms, 3, (2007), article 20 · Zbl 1321.68263
[15] Ferragina, P.; Venturini, R., A simple storage scheme for strings achieving entropy bounds, Theoretical Computer Science, 371, 115-121, (2007) · Zbl 1110.68029
[16] Fischer, J., Optimal succinctness for range minimum queries, (Proceedings of the 9th Latin American Symposium on Theoretical Informatics, (2010), Springer), 158-169 · Zbl 1283.68141
[17] Fischer, J.; Heun, V., A new succinct representation of RMQ-information and improvements in the enhanced suffix array, (Proceedings of the 1st Symposium on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, (2007), Springer), 459-470 · Zbl 1176.68058
[18] Gabow, H. N.; Bentley, J. L.; Tarjan, R. E., Scaling and related techniques for geometry problems, (Proceedings of the 16th Symposium on Theory of Computing, (1984), ACM), 135-143
[19] Gagie, T.; Kärkkäinen, J., Counting colours in compressed strings, (Proceedings of the 22nd Annual Symposium on Combinatorial Pattern Matching, (2011), Springer), 197-207 · Zbl 1339.68331
[20] Gagie, T.; Navarro, G.; Puglisi, S. J., Colored range queries and document retrieval, (Proceedings of the 17th Symposium on String Processing and Information Retrieval, (2010), Springer), 67-81
[21] Gagie, T.; Navarro, G.; Puglisi, S. J., New algorithms on wavelet trees and applications to information retrieval, Theoretical Computer Science, 426-427, 25-41, (2012) · Zbl 1243.68161
[22] Gagie, T.; Puglisi, S. J.; Turpin, A., Range quantile queries: another virtue of wavelet trees, (Proceedings of the 16th Symposium on String Processing and Information Retrieval, (2009), Springer), 1-6
[23] Golynski, A., Optimal lower bounds for rank and select indexes, Theoretical Computer Science, 387, 348-359, (2007) · Zbl 1144.68016
[24] Golynski, A.; Raman, R.; Rao, S., On the redundancy of succinct data structures, (Proceedings of the 11th Scandinavian Workshop on Algorithm Theory, (2008), Springer), 148-159 · Zbl 1155.68374
[25] González, R.; Navarro, G., Compressed text indexes with fast locate, (Proceedings of the 18th Symposium on Combinatorial Pattern Matching, (2007), Springer), 216-227 · Zbl 1138.68415
[26] Greve, M.; Jørgensen, A. G.; Larsen, K. D.; Truelsen, J., Cell probe lower bounds and approximations for range mode, (Proceedings of the 37th International Colloquium on Algorithms, Languages and Programming, (2010), Springer), 605-616 · Zbl 1288.68046
[27] Grossi, R.; Gupta, A.; Vitter, J. S., High-order entropy-compressed text indexes, (Proceedings of the 14th Symposium on Discrete Algorithms, (2003), SIAM), 636-645 · Zbl 1092.68584
[28] Grossi, R.; Orlandi, A.; Raman, R., Optimal trade-offs for succinct string indexes, (Proceedings of the 37th International Colloquium on Algorithms, Languages and Programming, (2010), Springer), 678-689 · Zbl 1288.68047
[29] He, M.; Munro, I., Succinct representations of dynamic strings, (Proceedings of the 17th International Symposium on String Processing and Information Retrieval, (2010), Springer), 334-346
[30] Hon, W.; Shah, R.; Wu, S., Efficient index for retrieving top-\(k\) most frequent documents, (Proceedings of the 16th Symposium on String Processing and Information Retrieval, (2009), Springer), 182-193
[31] Hon, W. K.; Shah, R.; Thankachan, S. V., Towards an optimal space-and-query-time index for top-\(k\) document retrieval, (Proceedings of the 23rd Symposium on Combinatorial Pattern Matching, (2012), Springer), 173-184 · Zbl 1358.68092
[32] Hon, W. K.; Shah, R.; Vitter, J., Space-efficient framework for top-\(k\) string retrieval problems, (Proceedings of the 50th Symposium on Foundations of Computer Science, (2009), IEEE), 713-722 · Zbl 1292.68182
[33] Ilyas, I. F.; Beskales, G.; Soliman, M. A., A survey of top-\(K\) query processing techniques in relational database systems, ACM Computing Surveys, 40, (2008)
[34] Janardan, R.; Lopez, M. A., Generalized intersection searching problems, International Journal of Computational Geometry and Applications, 3, 39-69, (1993) · Zbl 0777.68078
[35] Kaplan, H.; Rubin, N.; Sharir, M.; Verbin, E., Efficient colored orthogonal range counting, SIAM Journal on Computing, 38, 982-1011, (2008) · Zbl 1187.68172
[36] Karpinski, M.; Nekrich, Y., Top-\(K\) color queries for document retrieval, (Proceedings of the 22nd Symposium on Discrete Algorithms, (2011), SIAM), 401-411 · Zbl 1373.68197
[37] Lai, Y. K.; Poon, C. K.; Shi, B., Approximate colored range and point enclosure queries, Journal of Discrete Algorithms, 6, 420-432, (2008) · Zbl 1160.68352
[38] Mäkinen, V.; Navarro, G., Succinct suffix arrays based on run-length encoding, Nordic Journal of Computing, 12, 40-66, (2005) · Zbl 1085.68031
[39] Mäkinen, V.; Navarro, G., Implicit compression boosting with applications to self-indexing, (Proceedings of the 14th Symposium on String Processing and Information Retrieval, (2007), Springer), 229-241
[40] Mäkinen, V.; Navarro, G., Rank and select revisited and extended, Theoretical Computer Science, 387, 332-347, (2007) · Zbl 1144.68023
[41] Manber, U.; Myers, G., Suffix arrays: a new method for on-line string searches, SIAM Journal on Computing, 22, 935-948, (1993) · Zbl 0784.68027
[42] Manzini, G., An analysis of the Burrows-Wheeler transform, Journal of the ACM, 48, 407-430, (2001) · Zbl 1323.68262
[43] Matias, Y.; Muthukrishnan, S.; Sahinalp, S. C.; Ziv, J., Augmenting suffix trees, with applications, (Proceedings of the 6th European Symposium on Algorithms, (1998), Springer), 67-78
[44] Milidiú, R. L.; Laber, E. S., Bounding the inefficiency of length-restricted prefix codes, Algorithmica, 31, 513-529, (2001) · Zbl 1012.94008
[45] Morrison, D. R., PATRICIA — practical algorithm to retrieve information coded in alphanumeric, Journal of the ACM, 15, (1968)
[46] Munro, I., Tables, (Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science, (1996), Springer), 37-42
[47] Muthukrishnan, S., Efficient algorithms for document retrieval problems, (Proceedings of the 13th Symposium on Discrete Algorithms, (2002), SIAM), 657-666 · Zbl 1093.68588
[48] Navarro, G.; Mäkinen, V., Compressed full-text indexes, ACM Computing Surveys, 39, (2007) · Zbl 1321.68263
[49] Navarro, G.; Nekrich, Y., Top-\(k\) document retrieval in optimal time and linear space, (Proceedings of the 22nd Symposium on Discrete Algorithms, (2012), SIAM), 1066-1077 · Zbl 1422.68063
[50] Navarro, G.; Puglisi, S. J.; Valenzuela, D., Practical compressed document retrieval, (Proceedings of the 10th International Symposium on Experimental Algorithms, (2011), Springer), 193-205
[51] Okanohara, D.; Sadakane, K., Practical entropy-compressed rank/select dictionary, (Proceedings of the Workshop on Algorithm Engineering and Experiments, (2007), SIAM)
[52] Petersen, H.; Grabowski, S., Range mode and range Median queries in constant time and sub-quadratic space, Information Processing Letters, 109, 225-228, (2009) · Zbl 1191.68343
[53] Raman, R.; Raman, V.; Rao, S., Succinct indexable dictionaries with applications to encoding \(k\)-ary trees and multisets, (Proceedings of the 13th Symposium on Discrete Algorithms, (2002), SIAM), 233-242 · Zbl 1093.68582
[54] Sadakane, K., New text indexing functionalities of the compressed suffix arrays, Journal of Algorithms, 48, 294-313, (2003) · Zbl 1100.68563
[55] Sadakane, K., Succinct data structures for flexible text retrieval systems, Journal of Discrete Algorithms, 5, 12-22, (2007) · Zbl 1137.68360
[56] Sadakane, K.; Navarro, G., Fully-functional succinct trees, (Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms, (2010), SIAM), 134-149 · Zbl 1288.05046
[57] Välimäki, N.; Mäkinen, V., Space-efficient algorithms for document retrieval, (Proceedings of the 18th Symposium on Combinatorial Pattern Matching, (2007), Springer), 205-215 · Zbl 1138.68401
[58] Weiner, P., Linear pattern matching algorithm, (Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, (1973), IEEE), 1-11
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.