zbMATH — the first resource for mathematics

An external-memory algorithm for string graph construction. (English) Zbl 1369.68363
Summary: Some recent results [M. J. Bauer, Lect. Notes Comput. Sci. 7534, 326–337 (2012; Zbl 1370.68338); A. J. Cox, Lect. Notes Comput. Sci. 7534, 214–224 (2012; Zbl 1370.68340); G. Rosone and M. Sciortino, Lect. Notes Comput. Sci. 7921, 353–364 (2013; Zbl 1370.68088)] have introduced external-memory algorithms to compute self-indexes of a set of strings, mainly via computing the Burrows-Wheeler transform of the input strings. The motivations for those results stem from Bioinformatics, where a large number of short strings (called reads) are routinely produced and analyzed. In that field, a fundamental problem is to assemble a genome from a large set of much shorter samples extracted from the unknown genome. The approaches that are currently used to tackle this problem are memory-intensive. This fact does not bode well with the ongoing increase in the availability of genomic data. A data structure that is used in genome assembly is the string graph, where vertices correspond to samples and arcs represent two overlapping samples. In this paper we address an open problem of J. T. Simpson and R. Durbin [“Efficient construction of an assembly string graph using the FM-index”, Bioinform. 26, No. 12, i367–i373 (2010; doi:10.1093/bioinformatics/btq217)]: to design an external-memory algorithm to compute the string graph.

68W32 Algorithms on strings
Full Text: DOI
[1] Abouelhoda, M; Kurtz, S; Ohlebusch, E, Replacing suffix trees with enhanced suffix arrays, J. Discrete Algorithms, 2, 53-86, (2004) · Zbl 1115.92303
[2] Aggarwal, A; Vitter, J, The input/output complexity of sorting and related problems, Commun. ACM, 31, 1116-1127, (1988)
[3] Alizadeh, F; Karp, R; Newberg, L; Weisser, D, Physical mapping of chromosomes: a combinatorial problem in molecular biology, Algorithmica, 13, 52-76, (1995) · Zbl 0831.92012
[4] Alizadeh, F; Karp, R; Weisser, D; Zweig, G, Physical mapping of chromosomes using unique probes, J. Comput. Biol., 2, 159-184, (1995) · Zbl 0870.92005
[5] Bankevich, A; Nurk, S; Antipov, D; etal., Spades: a new genome assembly algorithm and its applications to single-cell sequencing, J. Comput. Biol., 19, 455-477, (2012)
[6] Bauer, M; Cox, A; Rosone, G, Lightweight algorithms for constructing and inverting the BWT of string collections, Theor. Comput. Sci., 483, 134-148, (2013) · Zbl 1292.68176
[7] Bauer, M., Cox, A., Rosone, G., Sciortino, M.: Lightweight LCP construction for next-generation sequencing datasets. In: Algorithms in Bioinformatics, LNCS, vol. 7534, pp. 326-337. Springer, Berlin, Germany (2012) · Zbl 1370.68338
[8] Beerenwinkel, N; Beretta, S; Bonizzoni, P; Dondi, R; Pirola, Y, Covering pairs in directed acyclic graphs, Comput. J., 58, 1673-1686, (2015) · Zbl 1407.68203
[9] Benson, D; Clark, K; Karsch-Mizrachi, I; etal., Genbank, Nucleic Acids Research, 42, d32-d37, (2014)
[10] Beretta, S; Bonizzoni, P; Della Vedova, G; Pirola, Y; Rizzi, R, Modeling alternative splicing variants from RNA-seq data with isoform graphs, J. Comput Biol, 16, 16-40, (2014)
[11] Blum, A; Jiang, T; Li, M; Tromp, J; Yannakakis, M, Linear approximation of shortest superstrings, J. ACM, 41, 630-647, (1994) · Zbl 0812.68075
[12] Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M., Rizzi, R.: Constructing string graphs in external memory. In: Algorithms in Bioinformatics, LNCS, vol. 8701, pp. 311-325. Springer, Berlin, Germany (2014)
[13] Bonizzoni, P; Della Vedova, G; Pirola, Y; Previtali, M; Rizzi, R, LSG: an external-memory tool to compute string graphs for NGS data assembly, J. Comput. Biol., 23, 137-149, (2016)
[14] Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report, Digital Systems Research Center (1994)
[15] Chen, Y., Dong, G., Han, J., Wah, B., Wang, J.: Multi-dimensional regression analysis of time-series data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 323-334. VLDB Endowment (2002) · Zbl 1292.68176
[16] Chikhi, R; Rizk, G, Space-efficient and exact de Bruijn graph representation based on a Bloom filter, Algorithms Mol. Biol., 8, 22, (2013)
[17] Cox, A; Bauer, M; Jakobi, T; Rosone, G, Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform, Bioinformatics, 28, 1415-1419, (2012)
[18] Cox, A., Jakobi, T., Rosone, G., Schulz-Trieglaff, O.: Comparing DNA sequence collections by direct comparison of compressed text indexes. In: Algorithms in Bioinformatics, LNCS, vol. 7534, pp. 214-224. Springer, Berlin, Germany (2012) · Zbl 1370.68340
[19] Demetrescu, C; Finocchi, I; Ribichini, A, Trading off space for passes in graph streaming problems, ACM Trans. Algorithms, 6, 6, (2009) · Zbl 1300.68072
[20] Diestel, R.: Graph Theory. Graduate Texts in Mathematics, 3rd edn. Springer, Heidelberg (2005) · Zbl 1074.05001
[21] Ferragina, P; Manzini, G, Indexing compressed text, J. ACM, 52, 552-581, (2005) · Zbl 1323.68261
[22] Henzinger, M., Raghavan, P., Rajagopalan, S.: Computing on data streams. In: External Memory Algorithms, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 50, pp. 107-118. AMS, Boston, MA, USA (1999) · Zbl 0947.68052
[23] Lacroix, V., Sammeth, M., Guigo, R., Bergeron, A.: Exact transcriptome reconstruction from short sequence reads. In: Algorithms in Bioinformatics, LNCS, vol. 5251, pp. 50-63. Springer, Berlin, Heidelberg (2008)
[24] Lam, T., Li, R., Tam, A., Wong, S., Wu, E., Yiu, S.: High throughput short read alignment via bi-directional BWT. In: Bioinformatics and Biomedicine (BIBM ’09), pp. 31-36. IEEE Computer Society, Washington, DC, USA (2009)
[25] McKenna, A; Hanna, M; Banks, E; etal., The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data, Genome Res., 20, 1297-1303, (2010)
[26] Myers, E, The fragment assembly string graph, Bioinformatics, 21, ii79-ii85, (2005)
[27] Pell, J; Hintze, A; Canino-Koning, R; Howe, A; Tiedje, J; Brown, C, Scaling metagenome sequence assembly with probabilistic de Bruijn graphs, PNAS, 109, 13272-13277, (2012) · Zbl 1256.68052
[28] Peng, Y; Leung, HC; Yiu, S-M; Chin, F, IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth, Bioinformatics, 28, 1420-1428, (2012)
[29] Rosone, G., Sciortino, M.: The Burrows-Wheeler transform between data compression and combinatorics on words. In: The Nature of Computation. Logic, Algorithms, Applications, LNCS, vol. 7921, pp. 353-364. Springer, Berlin, Heidelberg (2013) · Zbl 1370.68088
[30] Sedgewick, R.: Algorithms in Java. Addison-Wesley Professional, Reading (2002)
[31] Shi, F.: Suffix arrays for multiple strings: a method for on-line multiple string searches. In: Concurrency and Parallelism, Programming, Networking, and Security, LNCS, vol. 1179, pp. 11-22. Springer Berlin, Heidelberg (1996) · Zbl 1323.68261
[32] Simpson, J; Durbin, R, Efficient construction of an assembly string graph using the FM-index, Bioinformatics, 26, i367-i373, (2010)
[33] Simpson, J; Durbin, R, Efficient de novo assembly of large genomes using compressed data structures, Genome Res., 22, 549-556, (2012)
[34] Simpson, J; Wong, K; Jackman, S; etal., Abyss: a parallel assembler for short Read sequence data, Genome Res., 19, 1117-1123, (2009)
[35] Valiant, L.: General purpose parallel architectures. In: Handbook of Theoretical Computer Science, vol. A, pp. 943-973. MIT Press, Cambridge, MA, USA (1990) · Zbl 0900.68257
[36] Vitter, J, External memory algorithms and data structures: dealing with massive data, ACM Comput. Surv., 33, 209-271, (2001)
[37] Vitter, J; Shriver, E, Algorithms for parallel memory, I: two-level memories, Algorithmica, 12, 110-147, (1994) · Zbl 0917.68085
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.