×

zbMATH — the first resource for mathematics

New algorithms on wavelet trees and applications to information retrieval. (English) Zbl 1243.68161
Summary: Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how to use wavelet trees to solve fundamental algorithmic problems such as range quantile queries, range next value queries, and range intersection queries. We explore several applications of these queries in information retrieval, in particular document retrieval in hierarchical and temporal documents, and in the representation of inverted lists.

MSC:
68P05 Data structures
68P20 Information storage and retrieval of data
68W05 Nonnumerical algorithms
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Gagie, T.; Puglisi, S.; Turpin, A., Range quantile queries: another virtue of wavelet trees, (), 1-6
[2] Navarro, G.; Puglisi, S. J., Dual-sorted inverted lists, (), 310-322
[3] R. Grossi, A. Gupta, J. S. Vitter, High-order entropy-compressed text indexes, in: Proc. 14th Symposium on Discrete Algorithms, SODA, 2003, pp. 841-850. · Zbl 1092.68584
[4] Ferragina, P.; Manzini, G.; Mäkinen, V.; Navarro, G., Compressed representations of sequences and full-text indexes, ACM transactions on algorithms, 3, 2, (2007), Article 20 · Zbl 1321.68263
[5] Ferragina, P.; Manzini, G., Indexing compressed texts, Journal of the ACM, 52, 4, 552-581, (2005) · Zbl 1323.68261
[6] Ferragina, P.; Manzini, G.; Mäkinen, V.; Navarro, G., An alphabet-friendly FM-index, (), 150-160 · Zbl 1111.68429
[7] Mäkinen, V.; Navarro, G., Succinct suffix arrays based on run-length encoding, Nordic journal of computing, 12, 1, 40-66, (2005) · Zbl 1085.68031
[8] Mäkinen, V.; Navarro, G., Implicit compression boosting with applications to self-indexing, (), 214-226
[9] Chazelle, B., A functional approach to data structures and its use in multidimensional searching, SIAM journal on computing, 17, 3, 427-462, (1988) · Zbl 0679.68074
[10] Mäkinen, V.; Navarro, G., Position-restricted substring searching, (), 703-714 · Zbl 1145.68392
[11] P. Bose, M. He, A. Maheshwari, P. Morin, Succinct orthogonal range search structures on a grid with applications to text indexing, in: Proc. 11th International Symposium on Algorithms and Data Structures, WADS, 2009, pp. 98-109. · Zbl 1253.68103
[12] Brisaboa, N.; Luaces, M.; Navarro, G.; Seco, D., A fun application of compact data structures to indexing geographic data, (), 77-88
[13] J. Barbay, G. Navarro, Compressed representations of permutations, and applications, in: Proc. 26th International Symposium on Theoretical Aspects of Computer Science, STACS, 2009, pp. 111-122. · Zbl 1236.68063
[14] Barbay, J.; Claude, F.; Navarro, G., Compact rich-functional binary relation representations, (), 172-185
[15] Navarro, G., Indexing text using the ziv – lempel trie, Journal of discrete algorithms, 2, 1, 87-114, (2004) · Zbl 1118.68443
[16] Y.-F. Chien, W.-K. Hon, R. Shah, J. S. Vitter, Geometric Burrows-Wheeler transform: Linking range searching and text indexing, in: Proc. Data Compression Conference, DCC, 2008, pp. 252-261.
[17] Claude, F.; Navarro, G., Self-indexed text compression using straight-line programs, (), 235-246 · Zbl 1233.68132
[18] Välimäki, N.; Mäkinen, V., Space-efficient algorithms for document retrieval, (), 205-215 · Zbl 1138.68401
[19] J. Barbay, C. Kenyon, Adaptive intersection and \(t\)-threshold problems, in: Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2002, pp. 390-399. · Zbl 1093.68580
[20] Har-Peled, S.; Muthukrishnan, S., Range medians, (), 503-514 · Zbl 1158.68369
[21] S. Stolinski, S. Grabowski, W. Bieniecki, On efficient implementations of median filters in theory and practice, unpublished manuscript (2010).
[22] Crochemore, M.; Iliopoulos, C. S.; Rahman, M., Finding patterns in given intervals, (), 645-656 · Zbl 1147.68470
[23] Keller, O.; Kopelowitz, T.; Lewenstein, M., Range non-overlapping indexing and successive List indexing, (), 625-636 · Zbl 1209.68160
[24] M. Crochemore, C. S. Iliopoulos, M. Kubica, M. Rahman, T. Walen, Improved algorithms for the range next value problem and applications, in: Proc. 25th Symposium on Theoretical Aspects of Computer Science, STACS, 2008, pp. 205-216. · Zbl 1259.68226
[25] Hon, W.-K.; Shah, R.; Thankachan, S.; Vitter, J. S., String retrieval for multi-pattern queries, (), 55-66
[26] Manber, U.; Myers, G., Suffix arrays: a new method for on-line string searches, SIAM journal on computing, 22, 5, 935-948, (1993) · Zbl 0784.68027
[27] S. Muthukrishnan, Efficient algorithms for document retrieval problems, in: Proc 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2002, pp. 657-666. · Zbl 1093.68588
[28] M. Paˇtraşcu, Succincter, in: Proc. 49th IEEE Annual Symposium on Foundations of Computer Science, FOCS, 2008, pp. 305-313.
[29] R. Raman, V. Raman, S. Rao, Succinct indexable dictionaries with applications to encoding \(k\)-ary trees and multisets, in: Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2002, pp. 233-242. · Zbl 1093.68582
[30] Blum, M.; Floyd, R.W.; Pratt, V.R.; Rivest, R.L.; Tarjan, R.E., Time bounds for selection, Journal of computer and system sciences, 7, 4, 448-461, (1973) · Zbl 0278.68033
[31] Krizanc, D.; Morin, P.; Smid, M.H.M., Range mode and range Median queries on lists and trees, Nordic journal of computing, 12, 1, 1-17, (2005) · Zbl 1083.68028
[32] P. Bose, E. Kranakis, P. Morin, Y. Tang, Approximate range mode and range median queries, in: Proc. 22nd Symposium on Theoretical Aspects of Computer Science, STACS, 2005, pp. 377-388. · Zbl 1118.68441
[33] Gfeller, B.; Sanders, P., Towards optimal range medians, (), 475-486 · Zbl 1248.68180
[34] Petersen, H., Improved bounds for range mode and range Median queries, (), 418-423 · Zbl 1132.68426
[35] Petersen, H.; Grabowski, S., Range mode and range Median queries in constant time and sub-quadratic space, Information processing letters, 109, 4, 225-228, (2009) · Zbl 1191.68343
[36] Brodal, G.S.; Jørgensen, A.G., Data structures for range Median queries, (), 822-831 · Zbl 1273.68096
[37] Brodal, G.S.; Gfeller, B.; Jørgensen, A.G.; Sanders, P., Towards optimal range medians, Theoretical computer science, 412, 24, 2588-2601, (2011) · Zbl 1220.68052
[38] A.G. Jørgensen, K.D. Larsen, Range selection and median: Tight cell probe lower bounds and adaptive data structures, in: Proc. 22nd Symposium on Discrete Algorithms, SODA, 2011, pp. 805-813. · Zbl 1373.68196
[39] Mäkinen, V.; Navarro, G.; Ukkonen, E., Transposition invariant string matching, Journal of algorithms, 56, 2, 124-153, (2005) · Zbl 1083.68030
[40] C.-C. Yu, W.-K. Hon, B.-F. Wang, Efficient data structures for the orthogonal range successor problem, in: Proc. 15th International Computing and Combinatorics Conference, COCOON, 2009, pp. 96-105. · Zbl 1248.68175
[41] H. Gabow, J. Bentley, R. Tarjan, Scaling and related techniques for geometry problems, in: Proc. 16 ACM Symposium on Theory of Computing, STOC, 1984, pp. 135-143.
[42] E. Demaine, I. Munro, Adaptive set intersections, unions, and differences, in: Proc. 11th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2000, pp. 743-752. · Zbl 0957.68124
[43] Barbay, J.; Kenyon, C., Alternation and redundancy analysis of the intersection problem, ACM transactions on algorithms, 4, 1, (2008) · Zbl 1445.68338
[44] Barbay, J.; López-Ortiz, A.; Lu, T.; Salinger, A., An experimental investigation of set intersection algorithms for text searching, ACM journal of experimental algorithmics, 14, 3, (2009), Article 7 · Zbl 1284.68222
[45] Navarro, G.; Mäkinen, V., Compressed full text indexes, ACM computing surveys, 39, 1, (2007), Article 2 · Zbl 1321.68263
[46] Fischer, J.; Heun, V., A new succinct representation of RMQ-information and improvements in the enhanced suffix array, (), 459-470 · Zbl 1176.68058
[47] Sadakane, K., Succinct data structures for flexible text retrieval systems, Journal of discrete algorithms, 5, 1, 12-22, (2007) · Zbl 1137.68360
[48] Gagie, T.; Navarro, G.; Puglisi, S. J., Colored range queries and document retrieval, (), 67-81
[49] Baeza-Yates, R.; Ribeiro, B., Modern information retrieval, (1999), Addison-Wesley
[50] Witten, I.; Moffat, A.; Bell, T., Managing gigabytes, (1999), Morgan Kaufmann Publishers
[51] Baeza-Yates, R.; Moffat, A.; Navarro, G., Searching large text collections, (), 195-244 · Zbl 1024.68022
[52] Zobel, J.; Moffat, A., Inverted files for text search engines, ACM computing surveys, 38, 2, (2006), art. 6
[53] Zobel, J.; Moffat, A., Exploring the similarity space, ACM SIGIR forum, 32, 1, 18-34, (1998)
[54] Persin, M.; Zobel, J.; Sacks-Davis, R., Filtered document retrieval with frequency-sorted indexes, Journal of the American society for information sicence, 47, 10, 749-764, (1996)
[55] V. Anh, A. Moffat, Pruned query evaluation using pre-computed impacts, in: Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2006, pp. 372-379.
[56] T. Strohman, B. Croft, Efficient document retrieval in main memory, in: Proc. 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2007, pp. 175-182.
[57] Zipf, G., Human behaviour and the principle of least effort, (1949), Addison-Wesley
[58] Baeza-Yates, R., A fast set intersection algorithm for sorted sequences, (), 400-408 · Zbl 1103.68485
[59] Baeza-Yates, R.; Salinger, A., Experimental analysis of a fast intersection algorithm for sorted sequences, (), 13-24
[60] Barbay, J.; López-Ortiz, A.; Lu, T., Faster adaptive set intersections for text searching, (), 146-157
[61] P. Sanders, F. Transier, Intersection in integer inverted indices, in: Proc. 9th Workshop on Algorithm Engineering and Experiments, ALENEX, 2007.
[62] Culpepper, J. S.; Moffat, A., Compact set representation for information retrieval, (), 137-148
[63] Navarro, G.; Moura, E.; Neubert, M.; Ziviani, N.; Baeza-Yates, R., Adding compression to block addressing inverted indexes, Information retrieval, 3, 1, 49-77, (2000)
[64] Hull, D.A., Stemming algorithms: A case study for detailed evaluation, Journal of the American society for information science, 47, 1, 70-84, (1996)
[65] Xu, J.; Croft, W.B., Corpus-based stemming using cooccurrence of word variants, ACM transactions on information systems, 16, 1, 61-81, (1998)
[66] D. Hiemstra, V. Mihajlović, The simplest evaluation measures for XML information retrieval that could possibly work, in: Proc. INEX Workshop on Element Retrieval Methodology, 2005.
[67] J. Pehcevski, Evaluation of effective XML information retrieval, Ph.D. thesis, RMIT University, Australia (2006).
[68] Lalmas, M., XML retrieval, vol. 1, (2009), Morgan & Claypool Publishers
[69] D. Arroyuelo, F. Claude, S. Maneth, V. Mäkinen, G. Navarro, K. Nguyen, J. Sirén, N. Välimäki, Fast in-memory XPath search over compressed text and tree indexes, in: Proc. 26th IEEE International Conference on Data Engineering, ICDE, 2010, pp. 417-428.
[70] G. Jacobson, Space-efficient static trees and graphs, in: Proc. 30th Symposium on Foundations of Computer Science, FOCS, 1989, pp. 549-554.
[71] K. Sadakane, G. Navarro, Fully-functional succinct trees, in: Proc. 21st Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2010, pp. 134-149. · Zbl 1288.05046
[72] A. Golynski, I. Munro, S. Rao, Rank/select operations on large alphabets: a tool for text indexing, in: Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2006, pp. 368-373. · Zbl 1192.68800
[73] D. Okanohara, K. Sadakane, Practical entropy-compressed rank/select dictionary, in: Proc. 9th Workshop on Algorithm Engineering and Experiments, ALENEX, 2007.
[74] Munro, I., Tables, (), 37-42
[75] Culpepper, J.S.; Navarro, G.; Puglisi, S.J.; Turpin, A., Top-k ranked document search in general text databases, (), 194-205, (part II) · Zbl 1287.68035
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.