×

Coverage theories for metagenomic DNA sequencing based on a generalization of Stevens’ theorem. (English) Zbl 1275.05007

Summary: Metagenomic project design has relied variously upon speculation, semi-empirical and ad hoc heuristic models, and elementary extensions of single-sample Lander-Waterman expectation theory, all of which are demonstrably inadequate. Here, we propose an approach based upon a generalization of Stevens’ Theorem for randomly covering a domain.
We extend this result to account for the presence of multiple species, from which are derived useful probabilities for fully recovering a particular target microbe of interest and for average contig length. These show improved specificities compared to older measures and recommend deeper data generation than the levels chosen by some early studies, supporting the view that poor assemblies were due at least somewhat to insufficient data. We assess predictions empirically by generating roughly 4.5 Gb of sequence from a twelve member bacterial community, comparing coverage for two particular members, Selenomonas artemidis and Enterococcus faecium, which are the least (\(\sim \)3 %) and most (\(\sim \)12 %) abundant species, respectively. Agreement is reasonable, with differences likely attributable to coverage biases.
We show that, in some cases, bias is simple in the sense that a small reduction in read length to simulate less efficient covering brings data and theory into essentially complete accord. Finally, we describe two applications of the theory. One plots coverage probability over the relevant parameter space, constructing essentially a “metagenomic design map” to enable straightforward analysis and design of future projects. The other gives an overview of the data requirements for various types of sequencing milestones, including a desired number of contact reads and contig length, for detection of a rare viral species.

MSC:

05A10 Factorials, binomial coefficients, combinatorial functions
60D05 Geometric probability and stochastic geometry
62K05 Optimal statistical designs
92B99 Mathematical biology in general

Software:

BWA; Velvet
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Ajay SS, Parker SCJ, Abaan HO, Fuentes-Fajardo KV, Margulies EH (2011) Accurate and comprehensive sequencing of personal genomes. Genome Res 21(9):1498-1505 · doi:10.1101/gr.123638.111
[2] Allen EE, Banfield JF (2005) Community genomics in microbial ecology and evolution. Nat Rev Microbiol 3(6):489-498 · doi:10.1038/nrmicro1157
[3] Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, Chan AM, Haynes M, Kelley S, Liu H, Mahaffy JM, Mueller JE, Nulton J, Olson R, Parsons R, Rayhawk S, Suttle CA, Rohwer F (2006) The marine viromes of four oceanic regions. PLoS Biol 4(11), article no. e368
[4] Béjà O, Aravind L, Koonin EV, Suzuki MT, Hadd A, Nguyen LP, Jovanovich SB, Gates CM, Feldman RA, Spudich JL, Spudich EN, DeLong EF (2000) Bacterial rhodopsin: evidence for a new type of phototrophy in the sea. Science 289(5486):1902-1906 · doi:10.1126/science.289.5486.1902
[5] Beyer WH (1984) CRC standard mathematical tables. CRC Press, Boca Raton
[6] Bouck J, Miller W, Gorrell JH, Muzny D, Gibbs RA (1998) Analysis of the quality and utility of random shotgun sequencing at low redundancies. Genome Res 8(10):1074-1084
[7] Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall AM, Mead D, Azam F, Rohwer F (2002) Genomic analysis of uncultured marine viral communities. Proc Natl Acad Sci 99(22):14250-14255 · doi:10.1073/pnas.202488399
[8] Breitbart M, Hewson I, Felts B, Mahaffy JM, Nulton J, Salamon P, Rohwer F (2003) Metagenomic analyses of an uncultured viral community from human feces. J Bacteriol 185(20):6220-6223 · doi:10.1128/JB.185.20.6220-6223.2003
[9] Chen K, Pachter L (2005) Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput Biol 1(2):106-112 · doi:10.1371/journal.pcbi.0010024
[10] Clarke L, Carbon J (1976) A colony bank containing synthetic Col El hybrid plasmids representative of the entire E. coli genome. Cell 9(1):91-99 · doi:10.1016/0092-8674(76)90055-6
[11] Culley AI, Lang AS, Suttle CA (2006) Metagenomic analysis of coastal RNA virus communities. Science 312(5781):1795-1798 · doi:10.1126/science.1127404
[12] DeLong EF (2005) Microbial community genomics in the ocean. Nat Rev Microbiol 3(6):459-469 · doi:10.1038/nrmicro1158
[13] Dutilh BE, Huynen MA, Strous M (2009) Increasing the coverage of a metapopulation consensus genome by iterative read mapping and assembly. Bioinformatics 25(21):2878-2881 · doi:10.1093/bioinformatics/btp377
[14] Eisen JA (2007) Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes. PLoS Biol 5(3), article no. e82
[15] Feller W (1968) An introduction to probability theory and its applications. Wiley, New York · Zbl 0155.23101
[16] Fisher RA (1940) On the similarity of the distributions found for the test of significance in harmonic analysis and in Stevens’ problem in geometrical probability. Ann Eugen 10:14-17 · doi:10.1111/j.1469-1809.1940.tb02233.x
[17] Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, McKenney K, Sutton G, Fitzhugh W, Fields C, Gocayne JD, Scott J, Shirley R, Liu LI, Glodek A, Kelley JM, Weidman JF, Phillips CA, Spriggs T, Hedblom E, Cotton MD, Utterback TR, Hanna MC, Nguyen DT, Saudek DM, Brandon RC, Fine LD, Fritchman JL, Fuhrmann JL, Geoghagen NSM, Gnehm CL, McDonald LA, Small KV, Fraser CM, Smith HO, Venter JC (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269(5223):496-512 · doi:10.1126/science.7542800
[18] Gill SR, Pop M, DeBoy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE (2006) Metagenomic analysis of the human distal gut microbiome. Science 312(5778):1355-1359 · doi:10.1126/science.1124234
[19] Green ED (2001) Strategies for the systematic sequencing of complex genomes. Nat Rev Genet 2(8):573-583 · doi:10.1038/35084503
[20] Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM (1998) Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem Biol 5(10):R245-R249 · doi:10.1016/S1074-5521(98)90108-9
[21] Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S, Frazer KA (2009) Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol 10, article no. R32
[22] Hess M, Sczyrba A, Egan RWKT, Chokhawala H, Schroth G, Luo S, Clark DS, Chen F, Zhang T, Mackie RI, Pennacchio LA, Tringe SG, Visel A, Woyke T, Wang Z, Rubin EM (2011) Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 331(6016):463-467 · doi:10.1126/science.1200387
[23] Hooper SD, Dalevi D, Pati A, Mavromatis K, Ivanova NN, Kyrpides NC (2009) Estimating DNA coverage and abundance in metagenomes using a gamma approximation. Bioinformatics 26(3):295-301 · doi:10.1093/bioinformatics/btp687
[24] Kowalchuk GA, Speksnijder AGCL, Zhang K, Goodman RM, van Veen JA (2007) Finding the needles in the metagenome haystack. Microb Ecol 53(3):475-485 · doi:10.1007/s00248-006-9201-2
[25] Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P (2008) A bioinformatician’s guide to metagenomics. Microbiol Mol Biol Rev 72(4):557-578 · doi:10.1128/MMBR.00009-08
[26] Lander ES, Waterman MS (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2(3):231-239 · doi:10.1016/0888-7543(88)90007-9
[27] Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754-1760 · doi:10.1093/bioinformatics/btp324
[28] Liles MR, Manske BF, Bintrim SB, Handelsman J, Goodman RM (2003) A census of rRNA genes and linked genomic sequences within a soil metagenomic library. Appl Environ Microbiol 69(5):2684-2691 · doi:10.1128/AEM.69.5.2684-2691.2003
[29] Martín HG, Ivanova N, Kunin V, Warnecke F, Barry KW, McHardy AC, Yeates C, He S, Salamov AA, Szeto E, Dalin E, Putnam NH, Shapiro HJ, Pangilinan JL, Rigoutsos I, Kyrpides NC, Blackall LL, McMahon KD, Hugenholtz P (2006) Metagenomic analysis of two enhanced biological phosphorus removal EBPR sludge communities. Nat Biotechnol 24(10):1263-1269 · doi:10.1038/nbt1247
[30] Nicholls H (2007) Sorcerer II: the search for microbial diversity roils the waters. PLoS Biol 5(3), article no. e74
[31] Port E, Sun F, Martin D, Waterman MS (1995) Genomic mapping by end-characterized random clones: a mathematical analysis. Genomics 26(1):84-100 · doi:10.1016/0888-7543(95)80086-2
[32] Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T, Mende DR, Li J, Xu J, Li S, Li D, Cao J, Wang B, Liang H, Zheng H, Xie Y, Tap J, Lepage P, Bertalan M, Batto JM, Hansen T, Paslier DL, Linneberg A, Nielsen HB, Pelletier E, Renault P, Sicheritz-Ponten T, Turner K, Zhu H, Yu C, Li S, Jian M, Zhou Y, Li Y, Zhang X, Li S, Qin N, Yang H, Wang J, Brunak S, Doré J, Guarner F, Kristiansen K, Pedersen O, Parkhill J, Weissenbach J, Bork P, Ehrlich SD, Wang J (2010) A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285):59-65 · doi:10.1038/nature08821
[33] Riesenfeld CS, Schloss PD, Handelsman J (2004) Metagenomics: genomic analysis of microbial communities. Annu Rev Genet 38:525-552 · doi:10.1146/annurev.genet.38.072902.091216
[34] Roach JC (1995) Random subcloning. Genome Res 5(5):464-473 · doi:10.1101/gr.5.5.464
[35] Roach JC, Boysen C, Wang K, Hood L (1995) Pairwise end sequencing: a unified approach to genomic mapping and sequencing. Genomics 26(2):345-353 · doi:10.1016/0888-7543(95)80219-C
[36] Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers YH, Falcón LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, Platt T, Bermingham E, Gallardo V, Tamayo-Castillo G, Ferrari MR, Strausberg RL, Nealson K, Friedman R, Frazier M, Venter JC (2007) The Sorcerer II global ocean sampling expedition: Northwest Atlantic through eastern tropical Pacific. PLoS Biol 5(3), article no. e77
[37] Schbath S (1997) Coverage processes in physical mapping by anchoring random clones. J Comput Biol 4(1):61-82 · doi:10.1089/cmb.1997.4.61
[38] Schlüter A, Bekel T, Diaz NN, Dondrup M, Eichenlaub R, Gartemann KH, Krahn I, Krause L, Krömeke H, Kruse O, Mussgnug JH, Neuweger H, Niehaus K, Pühler A, Runte KJ, Szczepanowski R, Tauch A, Tilker A, Viehöver P, Goesmann A (2008) The metagenome of a biogas-producing microbial community of a production-scale biogas plant fermenter analysed by the 454-pyrosequencing technology. J Biotechnol 136(1-2):77-90
[39] Solomon H (1978) Geometric probability. Society for Industrial and Applied Mathematics, Philadelphia · Zbl 0382.60016 · doi:10.1137/1.9781611970418
[40] Stanhope SA (2010) Occupancy modeling, maximum contig size probabilities and designing metagenomic experiments. PLoS ONE 5(7), article no. e11,652
[41] Stevens WL (1939) Solution to a geometrical problem in probability. Ann Eugen 9:315-320 · Zbl 0023.05603 · doi:10.1111/j.1469-1809.1939.tb02216.x
[42] Thousand Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467(7319):1061-1073 · doi:10.1038/nature09534
[43] Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM (2005) Comparative metagenomics of microbial communities. Science 308(5721):554-557 · doi:10.1126/science.1107851
[44] Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428(6978):37-43 · doi:10.1038/nature02340
[45] Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO (2004) Environmental genome shotgun sequencing of the Sargasso sea. Science 304(5667):66-74 · doi:10.1126/science.1093857
[46] von Mering C, Hugenholtz P, Raes J, Tringe SG, Doerks T, Jensen LJ, Ward N, Bork P (2007) Quantitative phylogenetic assessment of microbial communities in diverse environments. Science 315(5815):1126-1130 · doi:10.1126/science.1133420
[47] Vos M, Quince C, Pijl AS, DeHollander M, Kowalchuk GA (2011) A comparison of rpoB and 16S rRNA as markers in pyrosequencing studies of bacterial diversity. PLoS ONE 7(2), article no. e30,600
[48] Wendl MC (2006a) A general coverage theory for shotgun DNA sequencing. J Comput Biol 13(6):1177-1196 · doi:10.1089/cmb.2006.13.1177
[49] Wendl MC (2006b) Occupancy modeling of coverage distribution for whole genome shotgun DNA sequencing. Bull Math Biol 68(1):179-196 · Zbl 1334.92319 · doi:10.1007/s11538-005-9021-4
[50] Wendl MC (2008) Random covering of multiple one-dimensional domains with an application to DNA sequencing. SIAM J Appl Math 68(3):890-905 · Zbl 1149.05301 · doi:10.1137/06065979X
[51] Wendl MC, Barbazuk WB (2005) Extension of Lander-Waterman theory for sequencing filtered DNA libraries. BMC Bioinform 6, article no. 245
[52] Wendl MC, Waterston RH (2002) Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing. Genome Res 12(12):1943-1949 · doi:10.1101/gr.655102
[53] Wendl MC, Wilson RK (2008) Aspects of coverage in medical DNA sequencing. BMC Bioinform 9, article no. 239
[54] Wendl MC, Wilson RK (2009a) Statistical aspects of discerning indel-type structural variation via DNA sequence alignment. BMC Genom 10, article no. 359
[55] Wendl MC, Wilson RK (2009b) The theory of discovering rare variants via DNA sequencing. BMC Genom 10, article no. 485
[56] Wendl MC, Marra MA, Hillier LW, Chinwalla AT, Wilson RK, Waterston RH (2001) Theories and applications for sequencing randomly selected clones. Genome Res 11(2):274-280 · doi:10.1101/gr.GR-1339R
[57] Wooley JC, Godzik A, Friedberg I (2010) A primer on metagenomics. PLoS Comput Biol 6(2), article no. e1000,667
[58] Xia LC, Cram JA, Chen T, Fuhrman JA, Sun F (2011) Accurate genome relative abundance estimation based on shotgun metagenomic reads. PLoS ONE 6(12), article no. e27,992
[59] Zerbino DR, Birney E (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5):821-829
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.