×

Parallel lightweight wavelet tree, suffix array and FM-index construction. (English) Zbl 1407.68111

Summary: We present parallel lightweight algorithms to construct wavelet trees, rank and select structures, and suffix arrays in a shared-memory setting. The work and depth of our first parallel wavelet tree algorithm match those of the best existing parallel algorithm while requiring asymptotically less memory and our second algorithm achieves the same asymptotic bounds for small alphabet sizes. Our experiments show that they are both faster and more memory-efficient than existing parallel algorithms. We also present an experimental evaluation of the parallel construction of rank and select structures, which are used in wavelet trees. Next, we design the first parallel suffix array algorithm based on induced copying. Our induced copying requires linear work and polylogarithmic depth for constant alphabets sizes. When combined with a parallel prefix doubling algorithm, it is more efficient in practice both in terms of running time and memory usage compared to existing parallel implementations. As an application, we combine our algorithms to build the FM-index in parallel.

MSC:

68P05 Data structures
68P10 Searching and sorting
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)

Software:

CMU Benchmarks
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Babenko, M. A.; Gawrychowski, P.; Kociumaka, T.; Starikovskaya, T. A., Wavelet trees meet suffix trees, (Proc. of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms (2015), SIAM), 572-591 · Zbl 1372.68070
[2] Bingmann, T.; Eberle, A.; Sanders, P., Engineering parallel string sorting, Algorithmica, 1-52 (2014)
[3] Buro, M., On the maximum length of Huffman codes, Inf. Process. Lett., 45, 219-223 (1993) · Zbl 0769.94007
[4] Clark, D., Compact Pat Trees (1996), University of Waterloo, PhD thesis
[5] Deo, M.; Keely, S., Parallel suffix array and least common prefix for the GPU, (SIGPLAN Notices, vol. 48 (2013), ACM), 197-206
[6] Edwards, J. A.; Vishkin, U., Parallel algorithms for Burrows-Wheeler compression and decompression, Theor. Comput. Sci., 525, 10-22 (2014) · Zbl 1358.68095
[7] Ferragina, P.; Manzini, G., Opportunistic data structures with applications, (Proc. of the 41st Annual Symposium on Foundations of Computer Science (2000), IEEE), 390-398
[8] Ferres, L.; Fuentes-Sepúlveda, J.; He, M.; Zeh, N., Parallel construction of succinct trees, (Proc. of the International Symposium on Experimental Algorithms (2015), Springer), 3-14
[9] Flick, P.; Aluru, S., Parallel distributed memory construction of suffix and longest common prefix arrays, (Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis (2015), ACM), 16
[10] Fuentes-Sepúlveda, J.; Elejalde, E.; Ferres, L.; Seco, D., Efficient wavelet tree construction and querying for multicore architectures, (Proc. of the International Symposium on Experimental Algorithms (2014)), 150-161
[11] Fuentes-Sepúlveda, J.; Elejalde, E.; Ferres, L.; Seco, D., Parallel construction of wavelet trees on multicore architectures, Knowl. Inf. Syst., 1-24 (2016)
[12] Gog, S.; Petri, M., Optimized succinct data structures for massive data, Softw. Pract. Exp., 44, 11, 1287-1314 (2014)
[13] Gog, S.; Beller, T.; Moffat, A.; Petri, M., From theory to practice: plug and play with succinct data structures, (Proc. of the International Symposium on Experimental Algorithms (2014), Springer), 326-337
[14] Grossi, R.; Gupta, A.; Vitter, J. S., High-order entropy-compressed text indexes, (Proc. of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2003)), 841-850 · Zbl 1092.68584
[15] Hayashi, S.; Taura, K., Parallel and memory-efficient Burrows-Wheeler transform, (Proc. of the International Conference on Big Data (2013), IEEE), 43-50
[16] Homann, R.; Fleer, D.; Giegerich, R.; Rehmsmeier, M., mkESA: enhanced suffix array construction tool, Bioinformatics, 25, 8, 1084-1085 (2009)
[17] Itoh, H.; Tanaka, H., An efficient method for in memory construction of suffix arrays, (Proc. of the String Processing and Information Retrieval Symposium (1999), IEEE), 81-88
[18] JaJa, J., An Introduction to Parallel Algorithms (1992), Addison-Wesley · Zbl 0781.68009
[19] Kärkkäinen, J.; Sanders, P., Simple linear work suffix array construction, (International Colloquium on Automata, Languages, and Programming (2003), Springer), 943-955 · Zbl 1039.68042
[20] Kärkkäinen, J.; Kempa, D.; Puglisi, S. J., Parallel external memory suffix sorting, (Proc. of the Annual Symposium on Combinatorial Pattern Matching (2015), Springer), 329-342 · Zbl 1432.68110
[21] Kulla, F.; Sanders, P., Scalable parallel suffix array construction, Parallel Comput., 33, 9, 605-612 (2007) · Zbl 06872844
[23] Labeit, J.; Shun, J.; Blelloch, G. E., Parallel lightweight wavelet tree, suffix array and FM-index construction, (Data Compression Conference (2016), IEEE), 33-42
[24] Larsson, N. J.; Sadakane, K., Faster suffix sorting, Theor. Comput. Sci., 387, 3, 258-272 (2007) · Zbl 1144.68022
[25] Liu, X.; Pande, P.; Meyerhenke, H.; Bader, D. A., PASQUAL: parallel techniques for next generation genome sequence assembly, IEEE Trans. Parallel Distrib. Syst., 24, 5, 977-986 (2013)
[26] Liu, Y.; Hankeln, T.; Schmidt, B., Parallel and space-efficient construction of Burrows-Wheeler transform and suffix array for big genome data, IEEE/ACM Trans. Comput. Biol. Bioinform., 13, 3, 592-598 (2015)
[27] Mäkinen, V.; Navarro, G., Succinct suffix arrays based on run-length encoding, (Proc. of the Annual Symposium on Combinatorial Pattern Matching (2005), Springer), 45-56 · Zbl 1131.68431
[28] Manber, U.; Myers, G., Suffix arrays: a new method for on-line string searches, SIAM J. Comput., 22, 5, 935-948 (1993) · Zbl 0784.68027
[29] Mori, Y., Libdivsufsort: a lightweight suffix sorting library, version 2.0.2-1
[30] Munro, J. I.; Nekrich, Y.; Vitter, J. S., Fast construction of wavelet trees, Theor. Comput. Sci., 638, 91-97 (2016) · Zbl 1344.68060
[31] Navarro, G., Wavelet trees for all, J. Discret. Algorithms, 25, 2-20 (2014) · Zbl 1284.68217
[32] Navarro, G.; Mäkinen, V., Compressed full-text indexes, ACM Comput. Surv., 39, 1, 2 (2007)
[33] Nong, G.; Zhang, S.; Chan, W. H., Linear suffix array construction by almost pure induced-sorting, (Proc. of the Data Compression Conference (2009), IEEE), 193-202
[34] Ohlebusch, E.; Beller, T.; Abouelhoda, M. I., Computing the Burrows-Wheeler transform of a string and its reverse in parallel, J. Discret. Algorithms, 25, 21-33 (2014) · Zbl 1284.68705
[35] Osipov, V., Parallel suffix array construction for shared memory architectures, (Proc. of the International Symposium on String Processing and Information Retrieval (2012), Springer), 379-384
[36] Puglisi, S. J.; Smyth, W. F.; Turpin, A. H., A taxonomy of suffix array construction algorithms, ACM Comput. Surv., 39, 2, 4 (2007)
[37] Shun, J., Parallel wavelet tree construction, (Proc. of the Data Compression Conference (2015), IEEE), 63-72
[38] Shun, J., Improved parallel construction of wavelet trees and rank/select structures, (Proc. of the Data Compression Conference (2017), IEEE), in press
[39] Shun, J.; Blelloch, G. E.; Fineman, J. T.; Gibbons, P. B.; Kyrola, A.; Simhadri, H. V.; Tangwongsan, K., Brief announcement: the problem based benchmark suite, (Proc. of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures (2012), ACM), 68-70
[40] Vigna, S., Broadword implementation of rank/select queries, (International Workshop on Experimental and Efficient Algorithms (2008), Springer), 154-168
[42] Wang, L.; Baxter, S.; Owens, J. D., Fast parallel suffix array on the GPU, (Proc. of the European Conference on Parallel Processing (2015), Springer), 573-587
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.