×

A tale of optimizing the space taken by de Bruijn graphs. (English) Zbl 07495161

De Mol, Liesbeth (ed.) et al., Connecting with computability. 17th conference on computability in Europe, CiE 2021, virtual event, Ghent, Belgium, July 5–9, 2021. Proceedings. Cham: Springer. Lect. Notes Comput. Sci. 12813, 120-134 (2021).
Summary: In the last decade in bioinformatics, many computational works have studied a graph data structure used to represent genomic data, the de Bruijn graph. It is closely tied to the problem of genome assembly, i.e. the reconstruction of an organism’s chromosomes using a large collection of overlapping short fragments. We start by highlighting this connection, noting that assembling genomes is a computationnally intensive task, and then focus our attention on the reduction of the space taken by de Bruijn graph data structures. This extended abstract is a retrospective centered around my own previous work in this area. It complements a recent review [10] by providing a less technical and more introductory exposition of a selection of concepts.
For the entire collection see [Zbl 1482.68017].

MSC:

68Qxx Theory of computing
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Alipanahi, B., Kuhnle, A., Puglisi, S.J., Salmela, L., Boucher, C.: Succinct Dynamic de Bruijn Graphs. Bioinformatics (2020). doi:10.1093/bioinformatics/btaa546/5848003
[2] Almodaresi, F.; Sarkar, H.; Srivastava, A.; Patro, R., A space and time-efficient index for the compacted colored de Bruijn graph, Bioinformatics, 34, 13, i169-i177 (2018) · doi:10.1093/bioinformatics/bty292
[3] Bankevich, A., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455-477 (2012)
[4] Boucher, C., Bowe, A., Gagie, T., Puglisi, S.J., Sadakane, K.: Variable-Order de Bruijn graphs. In: 2015 Data Compression Conference, pp. 383-392 (2015)
[5] Bowe, A.; Onodera, T.; Sadakane, K.; Shibuya, T.; Raphael, B.; Tang, J., Succinct de Bruijn graphs, Algorithms in Bioinformatics, 225-235 (2012), Heidelberg: Springer, Heidelberg · Zbl 1414.68020 · doi:10.1007/978-3-642-33122-0_18
[6] Břinda, K., Baym, M., Kucherov, G.: Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome. Biol. 22, 96 (2021). doi:10.1186/s13059-021-02297-z
[7] Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Report 124, Digital Systems Research Center, Palo Alto, CA, USA (May 1994)
[8] Cazaux, B., Rivals, E.: Hierarchical overlap graph. Inf. Proc. Lett. 155, 105862 (2020) · Zbl 1478.68222
[9] Chaisson, M.J., Pevzner, P.A.: Short read fragment assembly of bacterial genomes. Genome Res. 18(2), 324-330 (2008)
[10] Chikhi, R., Holub, J., Medvedev, P.: Data structures to represent sets of k-long DNA sequences. arXiv preprint arXiv:1903.12312 (2019)
[11] Chikhi, R.; Limasset, A.; Jackman, S.; Simpson, JT; Medvedev, P.; Sharan, R., On the representation of de Bruijn graphs, Research in Computational Molecular Biology, 35-55 (2014), Cham: Springer, Cham · doi:10.1007/978-3-319-05269-4_4
[12] Chikhi, R.; Limasset, A.; Medvedev, P., Compacting de Bruijn graphs from sequencing data quickly and in low memory, Bioinformatics, 32, 12, i201-i208 (2016) · doi:10.1093/bioinformatics/btw279
[13] Chikhi, R.; Rizk, G., Space-efficient and exact de Bruijn graph representation based on a bloom filter, Algorithms Mol. Biol., 8, 1, 1-9 (2013) · doi:10.1186/1748-7188-8-22
[14] Conway, T.C., Bromage, A.J.: Succinct data structures for assembling large genomes. Bioinformatics 27(4), 479-486 (2011)
[15] Deorowicz, S.; Debudaj-Grabysz, A.; Grabowski, S., Disk-based k-mer counting on a PC, BMC Bioinf., 14, 1, 1-12 (2013) · doi:10.1186/1471-2105-14-160
[16] Eizenga, JM, Pangenome graphs, Ann. Rev. Genomics Hum. Genet., 21, 1, 139-162 (2020) · doi:10.1146/annurev-genom-120219-080406
[17] Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings 41st Annual Symposium on Foundations of Computer Science, pp. 390-398. IEEE (2000)
[18] Holley, G.; Melsted, P., Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs, Genome Biol., 21, 1, 1-20 (2020) · doi:10.1186/s13059-020-02135-8
[19] Holley, G.; Wittler, R.; Stoye, J., Bloom filter trie: an alignment-free and reference-free data structure for pan-genome storage, Algorithms Mol. Biol., 11, 1, 1-9 (2016) · doi:10.1186/s13015-016-0066-8
[20] Holley, G.; Wittler, R.; Stoye, J.; Hach, F.; Sahinalp, SC, Dynamic alignment-free and reference-free read compression, Research in Computational Molecular Biology, 50-65 (2017), Cham: Springer, Cham · Zbl 1382.68059 · doi:10.1007/978-3-319-56970-3_4
[21] Iqbal, Z.; Caccamo, M.; Turner, I.; Flicek, P.; McVean, G., De novo assembly and genotyping of variants using colored de Bruijn graphs, Nature Genet., 44, 2, 226-232 (2012) · doi:10.1038/ng.1028
[22] Karasikov, M.: Indexing and analysing nucleotide archives at petabase-scale. bioRxiv (2020)
[23] Li, D.; Liu, C-M; Luo, R.; Sadakane, K.; Lam, T-W, MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph, Bioinformatics, 31, 10, 1674-1676 (2015) · doi:10.1093/bioinformatics/btv033
[24] Li, R., De novo assembly of human genomes with massively parallel short read sequencing, Genome Res., 20, 2, 265-272 (2010) · doi:10.1101/gr.097261.109
[25] Limasset, A., Rizk, G., Chikhi, R., Peterlongo, P.: Fast and scalable minimal perfect hashing for massive key sets. arXiv preprint arXiv:1702.03154 (2017) · Zbl 1433.68104
[26] Lin, Y., Yuan, J., Kolmogorov, M., Shen, M.W., Chaisson, M., Pevzner, P.A.: Assembly of long error-prone reads using de bruijn graphs. Proc. Natl. Acad. Sci. 113(52), E8396-E8405 (2016)
[27] Manekar, S.C., Sathe, S.R.: A benchmark study of k-mer counting methods for high-throughput sequencing. GigaScience 7(12), 125 (2018)
[28] Marçais, G.; Kingsford, C., A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 6, 764-770 (2011) · doi:10.1093/bioinformatics/btr011
[29] Marchet, C., Iqbal, Z., Gautheret, D., Salson, M., Chikhi, R.: REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics 36(Supplement_1):i177-i185 (2020)
[30] Marchet, C., Kerbiriou, M., Limasset, A.: Blight: Efficient exact associative structure for k-mers. bioRxiv (2020)
[31] Medvedev, P.: The theoretical analysis of sequencing bioinformatic algorithms. in preparation (2020)
[32] Miller, JR; Koren, S.; Sutton, G., Assembly algorithms for next-generation sequencing data, Genomics, 95, 6, 315-327 (2010) · doi:10.1016/j.ygeno.2010.03.001
[33] Muggli, M.D.: Succinct colored de Bruijn graphs. Bioinformatics 33(20), 3181-3187 (2017)
[34] Pandey, P., Bender, M.A., Johnson, R., Patro, R.: A general-purpose counting filter: Making every bit count. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 775-787 (2017)
[35] Patro, R., Duggal, G., Love, M.I., Irizarry, R.A., Kingsford, C.: Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14(4), 417-419 (2017)
[36] Pell, J.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. 109(33), 13272-13277 (2012) · Zbl 1256.68052
[37] Peng, Yu; Leung, HCM; Yiu, SM; Chin, FYL; Berger, B., IDBA - a practical iterative de Bruijn graph de novo assembler, Research in Computational Molecular Biology, 426-440 (2010), Heidelberg: Springer, Heidelberg · doi:10.1007/978-3-642-12683-3_28
[38] Rahman, A., Chikhi, R., Medvedev, P.: Disk Compression of k-mer Sets. In: 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2020) · Zbl 1518.92114
[39] Rizk, G.; Lavenier, D.; Chikhi, R., DSK: k-mer counting with very low memory usage, Bioinformatics, 29, 5, 652-653 (2013) · doi:10.1093/bioinformatics/btt020
[40] Salikhov, K.; Sacomoto, G.; Kucherov, G., Using cascading Bloom filters to improve the memory usage for de Brujin graphs, Algorithms Mol. Biol., 9, 1, 1-10 (2014) · doi:10.1186/1748-7188-9-2
[41] Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117-1123 (2009)
[42] Ye, C., Ma, Z.S., Cannon, C.H., Pop, M., Douglas, W.Y.: Exploiting sparseness in de novo genome assembly. In: BMC bioinformatics, vol. 13, pp. 1-8 (2012) BioMed Central
[43] Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821-829 (2008)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.