×

BayesCAT: Bayesian co-estimation of alignment and tree. (English) Zbl 1415.62135

Summary: Traditionally, phylogeny and sequence alignment are estimated separately: first estimate a multiple sequence alignment and then infer a phylogeny based on the sequence alignment estimated in the previous step. However, uncertainty in the alignment is ignored, resulting, possibly, in overstated certainty in phylogeny estimates. We develop a joint model for co-estimating phylogeny and sequence alignment which improves estimates from the traditional approach by accounting for uncertainty in the alignment in phylogenetic inferences. Our insertion and deletion (indel) model allows arbitrary-length overlapping indel events and a general distribution for indel fragment size. We employ a Bayesian approach using MCMC to estimate the joint posterior distribution of a phylogenetic tree and a multiple sequence alignment. Our approach has a tree and a complete history of indel events mapped onto the tree as the state space of the Markov Chain while alternative previous approaches have a tree and an alignment. A large state space containing a complete history of indel events makes our MCMC approach more challenging, but it enables us to infer more information about the indel process. The performances of this joint method and traditional sequential methods are compared using simulated data as well as real data. Software named BayesCAT (Bayesian Co-estimation of Alignment and Tree) is available at https://github.com/heejungshim/BayesCAT.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62F15 Bayesian inference
92D10 Genetics and epigenetics
PDFBibTeX XMLCite
Full Text: DOI arXiv Link

References:

[1] Bradley, R. K., Roberts, A., Smoot, M., Juvekar, S., Do, J., Dewey, C., Holmes, I., and Pachter, L. (2009). Fast statistical alignment. {\it PLoS Computational Biology}5, e1000392.
[2] Brown, J. R. and Doolittle, W. F. (1997). Archaea and the prokaryote‐to‐eukaryote transition. {\it Microbiology and Molecular Biology Reviews}61, 456-502.
[3] Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge, UK: Cambridge University Press. · Zbl 0929.92010
[4] Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. {\it Statistical Science}7, 457-472. · Zbl 1386.65060
[5] Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. {\it Biometrika}82, 711-732. · Zbl 0861.62023
[6] Hajiaghayi, M., Kirkpatrick, B., Wang, L., and Bouchard‐Côté, A. (2013). Efficient continuous‐time Markov chain estimation. {\it arXiv preprint arXiv:1309.3250}.
[7] Hasegawa, M., Kishino, H., and Yano, T. (1985). Dating of the human‐ape splitting by a molecular clock of mitochondrial DNA. {\it Journal of Molecular Evolution}22, 160-174.
[8] Liu, J. S., Wong, W. H., and Kong, A. (1995). Covariance structure and convergence rate of the Gibbs sampler with various scans. {\it Journal of the Royal Statistical Society, Series B (Methodological)}57, 157-169. · Zbl 0811.60056
[9] Liu, K., Raghavan, S., Nelesen, S., Linder, C. R., and Warnow, T. (2009). Rapid and accurate large‐scale coestimation of sequence alignments and phylogenetic trees. {\it Science}324, 1561-1564.
[10] Liu, K., Warnow, T. J., Holder, M. T., Nelesen, S. M., Yu, J., Stamatakis, et al. (2012). SATe‐II: Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. {\it Systematic Biology}61, 90-106.
[11] Lunter, G., Miklós, I., Drummond, A., Jensen, J. L., and Hein, J. (2005). Bayesian coestimation of phylogeny and sequence alignment. {\it BMC Bioinformatics}6, 83.
[12] Lutzoni, F., Wagner, P., Reeb, V., and Zoller, S. (2000). Integrating ambiguously aligned regions of DNA sequences in phylogenetic analyses without violating positional homology. {\it Systematic Biology}49, 628-651.
[13] Miklós, I., Lunter, G. A., and Holmes, I. (2004). A long indel” model for evolutionary sequence alignment. {\it Molecular Biology and Evolution}21, 529-540.
[14] Nelesen, S., Liu, K., Zhao, D., Linder, C. R., and Warnow, T. (2008). The effect of the guide tree on multiple sequence alignments and subsequent phylogenetic analyses. {\it Pacific Symposium on Biocomputing}13, 25-36.
[15] Novák, A., Miklós, I., Lyngsø, R., and Hein, J. (2008). StatAlign: An extendable software package for joint Bayesian estimation of alignments and evolutionary trees. {\it Bioinformatics (Oxford, England)}24, 2403-2404.
[16] Redelings, B. D. and Suchard, M. A. (2005). Joint Bayesian estimation of alignment and phylogeny. {\it Systematic Biology}54, 401-418.
[17] Redelings, B. D. and Suchard, M. A. (2007). Incorporating indel information into phylogeny estimation for rapidly emerging pathogens. {\it BMC Evolutionary Biology}7, 40.
[18] Schwartz, A. S. and Pachter, L. (2007). Multiple alignment by sequence annealing. {\it Bioinformatics}23, e24-9.
[19] Shim, H. (2010). BayesCAT: Bayesian co‐estimation of alignment and tree. PhD Thesis, Department of Statistics, University of Wisconsin at Madision. · Zbl 1415.62135
[20] Suchard, M. A. and Redelings, B. D. (2006). BAli‐Phy: Simultaneous Bayesian inference of alignment and phylogeny. {\it Bioinformatics}22, 2047-2048.
[21] Thorne, J. L., Kishino, H., and Felsenstein, J. (1991). An evolutionary model for maximum likelihood alignment of DNA sequences. {\it Journal of Molecular Evolution}33, 114-124.
[22] Thorne, J. L., Kishino, H., and Felsenstein, J. (1992). Inching toward reality: An improved likelihood model of sequence evolution. {\it Journal of Molecular Evolution}34, 3-16.
[23] Tierney, L. (1994). Markov chains for exploring posterior distributions. {\it The Annals of Statistics}22, 1701-1728. · Zbl 0829.62080
[24] Varón, A., Vinh, L. S., and Wheeler, W. C. (2010). POY version 4: Phylogenetic analysis using dynamic homologies. {\it Cladistics}26, 72-85.
[25] Wong, K. M., Suchard, M. A., and Huelsenbeck, J. P. (2008). Alignment uncertainty and genomic analysis. {\it Science}319, 473-476. · Zbl 1226.92028
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.