×

Efficient fully-compressed sequence representations. (English) Zbl 1307.68029

Summary: We present a data structure that stores a sequence \(s[1..n]\) over alphabet \([1..\sigma]\) in \(n\mathcal H_0(s)+o(n)(\mathcal H_0(s)+1)\) bits, where \(\mathcal H_0(s)\) is the zero-order entropy of \(s\). This structure supports the queries access, rank and select, which are fundamental building blocks for many other compressed data structures, in worst-case time \(\mathcal O(\lg\lg\sigma)\) and average time \(\mathcal O(\lg\mathcal H_0(s))\). The worst-case complexity matches the best previous results, yet these had been achieved with data structures using \(n\mathcal H_0(s)+o(n\lg\sigma)\) bits. On highly compressible sequences the \(o(n\lg\sigma)\) bits of the redundancy may be significant compared to the \(n\mathcal H_0(s)\) bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our average-case complexity is unprecedented.
Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy.
The result also improves upon the best current compressed representations of several other data structures. For example, we achieve (i) compressed redundancy, retaining the best time complexities, for the smallest existing full-text self-indexes; (ii) compressed permutations \(\pi\) with times for \(\pi()\) and \(\pi^{-1}()\) improved to loglogarithmic; and (iii) the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors.
Our structure is practical on large alphabets. Our experiments show that, as predicted by theory, it dominates the space/time tradeoff map of all the sequence representations, both in synthetic and application scenarios.

MSC:

68P05 Data structures
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)

Software:

WebGraph
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Arroyuelo, D.; González, S.; Oyarzún, M., Compressed self-indices supporting conjunctive queries on document collections, 43-54 (2010) · doi:10.1007/978-3-642-16321-0_5
[2] Barbay, J.; Claude, F.; Navarro, G., Compact rich-functional binary relation representations, No. 6034, 170-183 (2010) · Zbl 1283.68131
[3] Barbay, J., Golynski, A., Munro, J.I., Rao, S.S.: Adaptive searching in succinctly encoded binary relations and tree-structured documents. Theor. Comput. Sci. 387(3), 284-297 (2007) · Zbl 1144.68014 · doi:10.1016/j.tcs.2007.07.015
[4] Barbay, J., He, M., Munro, J.I., Rao, S.S.: Succinct indexes for strings, binary relations and multilabeled trees. ACM Trans. Algorithms 7(4), 52 (2011) · Zbl 1295.68227 · doi:10.1145/2000807.2000820
[5] Barbay, J., López-Ortiz, A., Lu, T., Salinger, A.: An experimental investigation of set intersection algorithms for text searching. ACM J. Exp. Algorithmics 14(3), 7 (2009) · Zbl 1284.68222
[6] Barbay, J.; Navarro, G., Compressed representations of permutations, and applications, 111-122 (2009) · Zbl 1236.68063
[7] Barbay, J., Navarro, G.: On compressing permutations and adaptive sorting. CoRR (2011). 1108.4408v1 · Zbl 1358.68079
[8] Belazzougui, D.; Navarro, G., Alphabet-independent compressed text indexing, No. 6942, 748-759 (2011) · Zbl 1325.68307
[9] Belazzougui, D.; Navarro, G., New lower and upper bounds for representing sequences, No. 7501, 181-192 (2012) · Zbl 1365.68260
[10] Boldi, P.; Vigna, S., The WebGraph framework I: compression techniques, 595-602 (2004) · doi:10.1145/988672.988752
[11] Brisaboa, N.; Luaces, M.; Navarro, G.; Seco, D., A new point access method based on wavelet trees, No. 5833, 297-306 (2009)
[12] Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994) · Zbl 1323.68262
[13] Clark, D.: Compact Pat Trees. Ph.D. Thesis, University of Waterloo, Canada (1996) · Zbl 1461.68271
[14] Clarke, C.; Cormack, G.; Tudhope, E., Relevance ranking for one to three term queries, 388-401 (1997)
[15] Claude, F.; Navarro, G., Practical rank/select queries over arbitrary sequences, 176-187 (2008)
[16] Claude, F.; Navarro, G.; Elomaa, T. (ed.); Mannila, H. (ed.); Orponen, P. (ed.), Extended compact web graph representations, No. 6060, 77-91 (2010), Berlin · Zbl 1284.68058 · doi:10.1007/978-3-642-12476-1_5
[17] Claude, F., Navarro, G.: Fast and compact web graph representations. ACM Trans. Web 4(4), 16 (2010) · doi:10.1145/1841909.1841913
[18] Demaine, E.; López-Ortiz, A.; Munro, J. I., Adaptive set intersections, unions, and differences, 743-752 (2000) · Zbl 0957.68124
[19] Fariña, A., Brisaboa, N., Navarro, G., Claude, F., Places, A., Rodríguez, E.: Word-based self-indexes for natural language text. ACM Trans. Inf. Syst. 30(1), 1 (2012) · doi:10.1145/2094072.2094073
[20] Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1), 4 (2009) · Zbl 1326.68132 · doi:10.1145/1613676.1613680
[21] Ferragina, P., Manzini, G.: Indexing compressed texts. J. ACM 52(4), 552-581 (2005) · Zbl 1323.68261 · doi:10.1145/1082036.1082039
[22] Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), 20 (2007) · Zbl 1321.68263 · doi:10.1145/1240233.1240243
[23] Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 372(1), 115-121 (2007) · Zbl 1110.68029 · doi:10.1016/j.tcs.2006.12.012
[24] Gagie, T.; Nekrich, Y., Worst-case optimal adaptive prefix coding, No. 5664, 315-326 (2009) · Zbl 1253.94038 · doi:10.1007/978-3-642-03367-4_28
[25] Golynski, A.: Optimal lower bounds for rank and select indexes. Theor. Comput. Sci. 387(3), 348-359 (2007) · Zbl 1144.68016 · doi:10.1016/j.tcs.2007.07.041
[26] Golynski, A., Cell probe lower bounds for succinct data structures, 625-634 (2009) · Zbl 1421.68028 · doi:10.1137/1.9781611973068.69
[27] Golynski, A.; Grossi, R.; Gupta, A.; Raman, R.; Srinivasa Rao, S., On the size of succinct indices, No. 4698, 371-382 (2007) · Zbl 1151.68385
[28] Golynski, A.; Munro, J. I.; Rao, S. S., Rank/select operations on large alphabets: a tool for text indexing, 368-373 (2006) · Zbl 1192.68800
[29] Golynski, A.; Raman, R.; Rao, S., On the redundancy of succinct data structures, No. 5124, 148-159 (2008) · Zbl 1155.68374
[30] González, R.; Grabowski, Sz.; Mäkinen, V.; Navarro, G., Practical implementation of rank and select queries, 27-38 (2005)
[31] González, R.; Navarro, G., Statistical encoding of succinct data structures, 294-305 (2006) · Zbl 1196.68060 · doi:10.1007/11780441_27
[32] Grossi, R.; Gupta, A.; Vitter, J., High-order entropy-compressed text indexes, 841-850 (2003) · Zbl 1092.68584
[33] Grossi, R.; Orlandi, A.; Raman, R., Optimal trade-offs for succinct string indexes, 678-689 (2010) · Zbl 1288.68047 · doi:10.1007/978-3-642-14165-2_57
[34] Grossi, R.; Orlandi, A.; Raman, R.; Srinivasa Rao, S., More haste, less waste: lowering the redundancy in fully indexable dictionaries, 517-528 (2009) · Zbl 1236.68064
[35] Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378-407 (2006) · Zbl 1092.68115 · doi:10.1137/S0097539702402354
[36] Haskel, B., Puri, A., Netravali, A.: Digital Video: an Introduction to MPEG-2. Chapman & Hall, London (1997)
[37] Hernández, C.; Navarro, G., Compression of web and social graphs supporting neighbor and community queries (2011), New York
[38] Hreinsson, J. B.; Krøyer, M.; Pagh, R., Storing a compressed function with constant time access, 730-741 (2009) · Zbl 1256.68057
[39] Huffman, D.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1090-1101 (1952) · Zbl 0137.13605 · doi:10.1109/JRPROC.1952.273898
[40] Levcopoulos, C., Petersson, O.: Sorting shuffled monotone sequences. Inf. Comput. 112(1), 37-50 (1994) · Zbl 0806.68030 · doi:10.1006/inco.1994.1050
[41] Mäkinen, V., Navarro, G.: Rank and select revisited and extended. Theor. Comput. Sci. 387(3), 332-347 (2007) · Zbl 1144.68023 · doi:10.1016/j.tcs.2007.07.013
[42] Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Trans. Algorithms 4(3), 32 (2008) · Zbl 1446.68043 · doi:10.1145/1367064.1367072
[43] Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407-430 (2001) · Zbl 1323.68262 · doi:10.1145/382780.382782
[44] Mehlhorn, K., Sorting presorted files, No. 67, 199-212 (1979) · Zbl 0395.68054 · doi:10.1007/3-540-09118-1_22
[45] Moffat, A., Turpin, A.: On the implementation of minimum-redundancy prefix codes. IEEE Trans. Commun. 45(10), 1200-1207 (1997) · doi:10.1109/26.634683
[46] Munro, I., Tables, No. 1180, 37-42 (1996) · doi:10.1007/3-540-62034-6_35
[47] Munro, I., Raman, R., Raman, V., Rao, S.S.: Succinct representations of permutations and functions. Theor. Comput. Sci. 438, 74-88 (2012) · Zbl 1245.68075 · doi:10.1016/j.tcs.2012.03.005
[48] Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2 (2007) · Zbl 1321.68263 · doi:10.1145/1216370.1216372
[49] Navarro, G., Nekrich, Y.: Optimal dynamic sequence representations. In: Proc. 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2013, to appear) · Zbl 1320.68060
[50] Okanohara, D.; Sadakane, K., Practical entropy-compressed rank/select dictionary, 60-70 (2007) · Zbl 1428.68134
[51] Pearlman, W., Islam, A., Nagaraj, N., Said, A.: Efficient, low-complexity image coding with a set-partitioning embedded block coder. IEEE Trans. Circuits Syst. Video Technol. 14(11), 1219-1235 (2004) · doi:10.1109/TCSVT.2004.835150
[52] Pennebaker, W., Mitchell, J.: JPEG: Still Image Data Compression Standard. Van Nostrand-Reinhold, New York (1992)
[53] Pătraşcu, M., Succincter, 305-313 (2008)
[54] Pătraşcu, M.: A lower bound for succinct rank queries. CoRR (2009). arXiv:0907.1103v1 [cs.DS]
[55] Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007) · Zbl 1446.68046 · doi:10.1145/1290672.1290680
[56] Russo, L., Navarro, G., Oliveira, A., Morales, P.: Approximate string matching with compressed indexes. Algorithms 2(3), 1105-1136 (2009) · Zbl 1461.68271 · doi:10.3390/a2031105
[57] Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294-313 (2003) · Zbl 1100.68563 · doi:10.1016/S0196-6774(03)00087-7
[58] Sadakane, K.; Grossi, R., Squeezing succinct data structures into entropy bounds, 1230-1239 (2006) · Zbl 1192.68188
[59] Said, A., Efficient alphabet partitioning algorithms for low-complexity entropy coding, 193-202 (2005)
[60] Tarjan, R.E., van Leeuwen, J.: Worst-case analysis of set union algorithms. J. ACM 31(2), 245-281 (1984) · Zbl 0632.68043 · doi:10.1145/62.2160
[61] Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann, San Mateo (1999) · Zbl 0821.68051
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.