×

Compressed directed acyclic word graph with application in local alignment. (English) Zbl 1275.68063

Summary: Suffix tree, suffix array, and directed acyclic word graph (DAWG) are data-structures for indexing a text. Although they enable efficient pattern matching, their data-structures require \(O(n\log n)\) bits, which make them impractical to index long text like human genome. Recently, the development of compressed data-structures allow us to simulate suffix tree and suffix array using \(O(n)\) bits. However, there is still no \(O(n)\)-bit data-structure for DAWG with full functionality.
This work introduces an \(n(H_{k}(\overline{S})+ 2 H_{0}^{*}(\mathcal {T}_{\overline{S}}))+o(n)\)-bit compressed data-structure for simulating DAWG (where \(H_{k}(\overline{S})\) and \(H_{0}^{*}(\mathcal{T}_{\overline{S}})\) are the empirical entropies of the reversed sequence and the reversed suffix tree topology, respectively.) Besides, we also propose an application of DAWG to improve the time complexity for the local alignment problem.
In this application, the previously proposed solutions using BWT (a version of compressed suffix array) run in \(O(n ^{2} m)\) worst case time and \(O(n ^{0.628} m)\) average case time where \(n\) and \(m\) are the lengths of the database and the query, respectively. Using compressed DAWG proposed in this paper, the problem can be solved in \(O(nm)\) worst case time and the same average case time.

MSC:

68P05 Data structures
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
68R10 Graph theory (including graph drawing) in computer science

Software:

BWA; REPuter
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403-410 (1990)
[2] Appel, A., Jacobson, G.: The world’s fastest scrabble program. Commun. ACM 31(5), 572-578 (1988)
[3] Baeza-Yates, R.; Gonnet, G., A fast algorithm on average for all-against-all sequence matching, 16-23 (1999)
[4] Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M., Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31-55 (1985) · Zbl 0574.68070
[5] Chim, H.; Deng, X., A new suffix tree similarity measure for document clustering, 121-130 (2007)
[6] Crochemore, M.; Vérin, R., On compact directed acyclic word graphs, No. 1261, 192-211 (1997)
[7] Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552-581 (2005) · Zbl 1323.68261
[8] Golynski, A.; Munro, J. I.; Rao, S. S., Rank/select operations on large alphabets: a tool for text indexing, 368-373 (2006) · Zbl 1192.68800
[9] Grossi, R.; Gupta, A.; Vitter, J., High-order entropy-compressed text indexes, 841-850 (2003) · Zbl 1092.68584
[10] Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997) · Zbl 0934.68103
[11] Huang, J.; Powers, D., Suffix tree based approach for Chinese information retrieval, 393-397 (2008)
[12] Inenaga, S.; Takeda, M., Sparse compact directed acyclic word graphs, 197-211 (2006)
[13] Jansson, J., Sadakane, K., Sung, W.: Ultra-succinct representation of ordered trees with applications. J. Comput. Syst. Sci. 78(2), 619-631 (2012) · Zbl 1242.68083
[14] Kurtz, S., Choudhuri, J.V., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: Reputer: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29, 4633-4642 (2001)
[15] Lam, T.W., Sung, W.K., Tam, S.L., Wong, C.K., Yiu, S.M.: Compressed indexing and local alignment of DNA. Bioinformatics 24(6), 791-797 (2008)
[16] Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)
[17] Larsson, N., Extended application of suffix trees to data compression, 190-199 (1996)
[18] Li, H., Durbin, R.: Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics 26(5), 589-595 (2010)
[19] Maaß, M.: Average-case analysis of approximate trie search. Algorithmica 46(3), 469-491 (2006) · Zbl 1106.68030
[20] Mäkinen, V.; Navarro, G., Implicit compression boosting with applications to self-indexing, 229-241 (2007)
[21] Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 935-948 (1993) · Zbl 0784.68027
[22] Meek, C.; Patel, J.; Kasetty, S., Oasis: an online and accurate technique for local-alignment searches on biological sequences, 910-921 (2003)
[23] Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33, 31-88 (2001)
[24] Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. J. Discrete Algorithms 1, 205-239 (2000)
[25] Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41, 589-607 (2007) · Zbl 1148.68015
[26] Senft, M., Suffix tree based data compression, 350-359 (2005) · Zbl 1117.68340
[27] Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197 (1981)
[28] Sung, W.-K., Indexed approximate string matching, 408-410 (2008)
[29] Weiner, P., Linear pattern matching algorithms, 1-11 (1973)
[30] Wong, S.; Sung, W.; Wong, L., CPS-tree: a compact partitioned suffix tree for disk-based indexing on large genome sequences, 1350-1354 (2007)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.