×

Consistency of a phylogenetic tree maximum likelihood estimator. (English) Zbl 1311.62185

Summary: Phylogenetic trees represent the order and extent of genetic divergence of a fixed collection of organisms. Order of divergence is represented via the tree structure, and extent of divergence by the branch lengths. Both the tree’s structure and branch lengths are unknown parameters and the tree is estimated using sequence information sampled at a number of genetic sites. Under the model of genetic Brownian motion, we prove that as the number of genetic sites that are sampled becomes large, the maximum likelihood estimator of the tree is consistent. (Our maximum likelihood estimator treats each site as an independent data point, which is different from concatenating the sites.) Existing arguments for consistency rely on the assumption of a finite parameter space or only apply to transition probability matrix-based models, and do not hold here due to the continuous model for branch lengths. The metric space of [L. J. Billera et al., Adv. Appl. Math. 27, No. 4, 733–767 (2001; Zbl 0995.92035)] is central to the proof. We conclude with some comments on the role of parametric methods in tree estimation.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62F12 Asymptotic properties of parametric estimators
92D15 Problems related to evolution

Citations:

Zbl 0995.92035
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Alfaro, M. E.; Zoller, S.; Lutzoni, F., Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence, Mol. Biol. Evol., 20, 2, 255-266 (2003)
[2] Billera, L. J.; Holmes, S. P.; Vogtmann, K., Geometry of the space of phylogenetic trees, Adv. Appl. Math., 27, 4, 733-767 (2001) · Zbl 0995.92035
[3] Blaschko, M. B.; Zaremba, W.; Gretton, A., Taxonomic prediction with tree-structured covariances, (Machine Learning and Knowledge Discovery in Databases (2013), Springer), 304-319
[4] Bouckaert, R.; Heled, J.; Kühnert, D.; Vaughan, T.; Wu, C.-H.; Xie, D.; Suchard, M. A.; Rambaut, A.; Drummond, A. J., BEAST 2: A software platform for Bayesian evolutionary analysis, PLoS Comput. Biol., 10, 4 (2014)
[5] Bravo, H. C.; Wright, S.; Eng, K. H.; Keles, S.; Wahba, G., Estimating tree-structured covariance matrices via mixed-integer programming, J. Mach. Learn. Res., 5, 41 (2009)
[6] Brumfield, R. T.; Liu, L.; Lum, D. E.; Edwards, S. V., Comparison of species tree methods for reconstructing the phylogeny of bearded manakins (aves: Pipridae, Manacus) from multilocus sequence data, Syst. Biol., 57, 719-731 (2008)
[7] Bryant, D.; Bouckaert, R.; Felsenstein, J.; Rosenberg, N. A.; RoyChoudhury, A., Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis, Mol. Biol. Evol., 29, 1917-1932 (2012)
[8] Chang, J. T., Full reconstruction of Markov models on evolutionary trees: Identifiability and consistency, Math. Biosci., 137, 51-73 (1996) · Zbl 1059.92504
[9] Ronquist, F.; Teslenko, M.; van der Mark, P.; Ayres, D. L.; Darling, A.; Höhna, S.; Larget, B.; Liu, L.; Suchard, M. A.; Huelsenbeck, J. P., MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space, Syst. Biol., 61, 3, 539-542 (2012)
[10] Felsenstein, J., Maximum-likelihood estimation of evolutionary trees from continuous characters, Am. J. Hum. Genet., 25, 5, 471 (1973)
[11] Felsenstein, J., Inferring Phylogenies, Vol. 2 (2004), Sinauer Associates Sunderland
[12] Ferguson, T. S., An inconsistent maximum likelihood estimate, J. Amer. Statist. Assoc., 77, 380, 831-834 (1982) · Zbl 0507.62022
[13] Kubatko, L. S.; Degnan, J. H., Inconsistency of phylogenetic estimates from concatenated data under coalescence, Syst. Biol., 56, 17-24 (2007)
[14] Lehmann, E., Elements of Large-Sample Theory (1999), Springer: Springer New York · Zbl 0914.62001
[15] Luo, R.; Larget, B., Modeling substitution and indel processes for AFLP marker evolution and phylogenetic inference, Ann. Appl. Stat., 222-248 (2009) · Zbl 1160.62091
[16] Lakner, C.; Van Der Mark, P.; Huelsenbeck, J. P.; Larget, B.; Ronquist, F., Efficiency of Markov Chain Monte Carlo tree proposals in Bayesian phylogenetics, Syst. Biol., 57, 1, 86-103 (2008)
[17] Nye, T. M., Principal components analysis in the space of phylogenetic trees, Ann. Statist., 39, 5, 2716-2739 (2011) · Zbl 1231.62110
[18] Owen, M., Computing geodesic distances in tree space, SIAM J. Discrete Math., 25, 4, 1506-1529 (2011) · Zbl 1237.05045
[19] Owen, M.; Provan, J. S., A fast algorithm for computing geodesic distances in tree space, IEEE/ACM Trans. Comput. Biol. Bioinform., 8, 1, 2-13 (2011)
[20] Redner, R., Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions, Ann. Statist., 9, 1, 225-228 (1981) · Zbl 0453.62021
[21] Rogers, J. S., On the consistency of maximum likelihood estimation of phylogenetic trees from nucleotide sequences, Syst. Biol., 354-357 (1997)
[22] Rosenberg, N. A.; Tao, R., Discordance of species trees with their most likely gene trees: The case of five taxa, Syst. Biol., 57, 131-140 (2008)
[23] RoyChoudhury, A., Identifiability of a coalescent-based population tree model, J. Appl. Prob., 51, 921-929 (2014) · Zbl 1333.92053
[24] RoyChoudhury, A., Composite likelihood-based inferences on genetic data from dependent loci, J. Math. Biol., 62, 65-80 (2011) · Zbl 1232.62153
[25] RoyChoudhury, A., Approximate likelihood estimation of divergence time range using a coalescent-based model, Evol. Bioinform., 9, 499-509 (2013)
[26] RoyChoudhury, A.; Felsenstein, J.; Thompson, E. A., A two-stage pruning algorithm for likelihood computation for a population tree, Genetics, 180, 1095-1105 (2008)
[27] RoyChoudhury, A.; Thompson, E. A., Ascertainment correction for a population tree via a pruning algorithm for likelihood computation, Theoret. Popul. Biol., 82, 59-65 (2012) · Zbl 1404.92130
[28] Schröder, E., Vier combinatorische probleme, Z. Math. Phys., 15, 1870, 361-370 (1870)
[29] Shirali, S.; Vasudeva, H. L., Metric Spaces (2006), Springer · Zbl 1095.54001
[30] Stromberg, K. R., An Introduction to Classical Real Analysis (1981), Wadsworth International · Zbl 0454.26001
[31] Tamura, K.; Peterson, D.; Peterson, N.; Stecher, G.; Nei, M.; Kumar, S., MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods, Mol. Biol. Evol., 28, 10, 2731-2739 (2011)
[32] Wald, A., Note on the consistency of the maximum likelihood estimate, Ann. Math. Statist., 595-601 (1949) · Zbl 0034.22902
[33] Wang, H.-C.; Susko, E.; Roger, A. J., An amino acid substitution-selection model adjusts residue fitness to improve phylogenetic estimation, Mol. Biol. Evol (2014)
[34] Weyenberg, G.; Huggins, P. M.; Schardl, C. L.; Howe, D. K.; Yoshida, R., kdetrees: Nonparametric estimation of phylogenetic tree distributions, Bioinformatics, btu258 (2014)
[36] Yang, Z., Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods, Syst. Biol., 43, 3, 329-342 (1994)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.