×

zbMATH — the first resource for mathematics

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. (English) Zbl 1256.68052
Summary: Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the \(k\)-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per \(k\)-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.

MSC:
68P05 Data structures
05C80 Random graphs (graph-theoretic aspects)
05C90 Applications of graph theory
92-08 Computational methods for problems pertaining to biology
92D10 Genetics and epigenetics
Software:
ALLPATHS; GAGE
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Briefings in Bioinformatics 10 (4) pp 354– (2009)
[2] Genome Research 22 (3) pp 557– (2012)
[3] Qin, Nature; Physical Science (London) 464 (7285) pp 59– (2010)
[4] Hess, Science 331 (6016) pp 463– (2011)
[5] Wooley 6 (2) pp e1000667– (2010)
[6] Gans, Science 309 (5739) pp 1387– (2005)
[7] Science 304 (5667) pp 66– (2004)
[8] Mackelprang, Nature; Physical Science (London) 480 (7377) pp 368– (2011)
[9] Pevzner, PNAS 98 (17) pp 9748– (2001) · Zbl 0993.92018
[10] Miller, Genomics 95 (6) pp 315– (2010)
[11] Compeau, Nature biotechnology 29 (11) pp 987– (2011)
[12] Bioinformatics 27 (4) pp 479– (2011) · Zbl 05891125
[13] PNAS 108 (4) pp 1513– (2011)
[14] Kelley, Genome biology 11 (11) pp R116– (2010)
[15] CACM 13 pp 422– (1970) · Zbl 0195.47003
[16] Shi, Journal of computational biology : a journal of computational molecular cell biology 17 (4) pp 603– (2010)
[17] Bioinformatics 26 (13) pp 1595– (2010) · Zbl 1183.68146
[18] Melsted, BMC bioinformatics [electronic resource] 12 pp 333– (2011)
[19] Liu, BMC bioinformatics [electronic resource] 12 pp 85– (2011) · Zbl 05889789
[20] Genome Research 18 (5) pp 821– (2008)
[21] Genome Research 19 (6) pp 1117– (2009)
[22] Bioinformatics 27 (13) pp i94– (2011) · Zbl 1263.92015
[23] Grabherr, Nature biotechnology 29 (7) pp 644– (2011)
[24] PHYS REP 54 pp 1– (1979)
[25] Gilbert 3 (3) pp 243– (2010)
[26] Gilbert 3 (3) pp 249– (2010)
[27] ZHANG, Cold Spring Harbor Symposia on Quantitative Biology 68 (0) pp 205– (2003)
[28] Price, Bioinformatics 21 (suppl_1) pp i351– (2005)
[29] Iqbal, Nature genetics 44 (2) pp 226– (2012)
[30] 1 pp 485– (2004) · Zbl 1090.68515
[31] PHYS REV E 66 pp 011907– (2002)
[32] TRANS AM MATH SOC 54 pp 426– (1943)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.