×

Bayesian additive regression trees using Bayesian model averaging. (English) Zbl 1386.68131

Summary: Bayesian Additive Regression Trees (BART) is a statistical sum of trees model. It can be considered a Bayesian version of machine learning tree ensemble methods where the individual trees are the base learners. However, for datasets where the number of variables \(p\) is large the algorithm can become inefficient and computationally expensive. Another method which is popular for high-dimensional data is random forests, a machine learning algorithm which grows trees using a greedy search for the best split points. However, its default implementation does not produce probabilistic estimates or predictions. We propose an alternative fitting algorithm for BART called BART-BMA, which uses Bayesian model averaging and a greedy search algorithm to obtain a posterior distribution more efficiently than BART for datasets with large \(p\). BART-BMA incorporates elements of both BART and random forests to offer a model-based algorithm which can deal with high-dimensional data. We have found that BART-BMA can be run in a reasonable time on a standard laptop for the “small \(n\) large \(p\)” scenario which is common in many areas of bioinformatics. We showcase this method using simulated data and data from two real proteomic experiments, one to distinguish between patients with cardiovascular disease and controls and another to classify aggressive from non-aggressive prostate cancer. We compare our results to their main competitors. Open source code written in R and Rcpp to run BART-BMA can be found at: https://github.com/BelindaHernandez/BART-BMA.git.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62F15 Bayesian inference
62P10 Applications of statistics to biology and medical sciences; meta analysis
PDFBibTeX XMLCite
Full Text: DOI arXiv Link

References:

[1] Albert, J.H., Chib, S.: Bayesian analysis of binary and polychotomous response data. J. Am. Stat. Assoc. 88(422), 669-679 (1993) · Zbl 0774.62031 · doi:10.1080/01621459.1993.10476321
[2] Archer, K., Kimes, R.: Empirical characterization of random forest variable importance measures. Comput. Stat. Data Anal. 52(4), 2249-2260 (2008). doi:10.1016/j.csda.2007.08.015 · Zbl 1452.62027 · doi:10.1016/j.csda.2007.08.015
[3] Beaumont, M.A., Rannala, B.: The Bayesian revolution in genetics. Nat. Rev. Genet. 5(4), 251-261 (2004) · doi:10.1038/nrg1318
[4] Bleich, J., Kapelner, A., George, E.I., Jensen, S.T.: Variable selection for BART: an application to gene regulation. Ann. Appl. Stat. 8(3), 1750-1781 (2014) · Zbl 1304.62132 · doi:10.1214/14-AOAS755
[5] Breiman, L.: Bagging predictors. Mach. Learn. 26, 123-140 (1996a) · Zbl 0858.68080
[6] Breiman, L.: Stacked regressions. Mach. Learn. 24, 41-64 (1996b) · Zbl 0849.68104
[7] Breiman, L.: Random forests. Mach. Learn. 45, 5-32 (2001). doi:10.1186/1478-7954-9-29 · Zbl 1007.68152 · doi:10.1186/1478-7954-9-29
[8] Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984) · Zbl 0541.62042
[9] Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin (2011) · Zbl 1273.62015 · doi:10.1007/978-3-642-20192-9
[10] Chipman, H., George, E.I., McCulloch, R.E.M.: Bayesian CART model search. J. Am. Stat. Assoc. 93(443), 935-948 (1998) · doi:10.1080/01621459.1998.10473750
[11] Chipman, H., George, E.I., Mcculloch, R.E.M.: BART: Bayesian additive regression trees. Ann. Appl. Stat. 4(1), 266-298 (2010) · Zbl 1189.62066 · doi:10.1214/09-AOAS285
[12] Chipman, H., McCulloch, R., Dorie, V.: Package dbarts (2014). https://cran.r-project.org/web/packages/dbarts/dbarts.pdf · Zbl 1110.68124
[13] Cortes, I.: Package conformal (2014). https://cran.r-project.org/web/packages/conformal/conformal.pdf
[14] Daz-Uriarte, R., Alvarez de Andrés, S.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7, 3 (2006). doi:10.1186/1471-2105-7-3 · doi:10.1186/1471-2105-7-3
[15] Friedman, J.H.: Multivariate adaptive regression splines (with discussion and a rejoinder by the author). Ann. Stat. 19, 1-67 (1991) · Zbl 0765.62064 · doi:10.1214/aos/1176347963
[16] Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189-1232 (2001). doi:10.1214/aos/1013203451 · Zbl 1043.62034
[17] Fujikoshi, Y., Ulyanov, V.V., Shimizu, R.: Multivariate Statistics: High-Dimensional and Large-Sample Approximations, vol. 760. Wiley, Hoboken (2011) · Zbl 1304.62016
[18] Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3-42 (2006) · Zbl 1110.68124 · doi:10.1007/s10994-006-6226-1
[19] Ham, J., Chen, Y., Crawford, M.M., Ghosh, J.: Investigation of the random forest framework for classification of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 43(3), 492-501 (2005). doi:10.1109/TGRS.2004.842481 · doi:10.1109/TGRS.2004.842481
[20] Harris, K., Girolami, M., Mischak, H.: Pattern Recognition in Bioinformatics, Lecture Notes in Computer Science, chap. Definition of Valid Proteomic Biomarkers: A Bayesian Solution, pp. 137-149. Springer, Berlin (2009) · Zbl 0990.62019
[21] Hawkins, D.M.: Fitting multiple change-point models to data. Comput. Stat. Data Anal. 37(3), 323-341 (2001) · Zbl 0990.62019 · doi:10.1016/S0167-9473(00)00068-2
[22] Hernández, B., Parnell, A.C., Pennington, S.R.: Why have so few proteomic biomarkers “survived” validation? (sample size and independent validation considerations). Proteomics 14(13-14), 1587-1592 (2014) · doi:10.1002/pmic.201300377
[23] Hernández, B., Pennington, S.R., Parnell, A.C.: Bayesian methods for proteomic biomarker development. EuPA Open Proteomics 9, 54-64 (2015) · doi:10.1016/j.euprot.2015.08.001
[24] Hutter, F., Xu, L., Hoos, H.H., Leyton-Brown, K.: Algorithm runtime prediction: methods & evaluation. Artif. Intell. 206, 79-111 (2014) · Zbl 1334.68185 · doi:10.1016/j.artint.2013.10.003
[25] Johansson, U., Boström, H., Löfström, T., Linusson, H.: Regression conformal prediction with random forests. Mach. Learn. 97(1-2), 155-176 (2014) · Zbl 1319.68175 · doi:10.1007/s10994-014-5453-0
[26] Kapelner, A., Bleich, J.: bartmachine: machine learning with Bayesian additive regression trees. ArXiv e-prints (2014a) · Zbl 1328.62243
[27] Kapelner, A., Bleich, J.: Package bartMachine (2014b). http://cran.r-project.org/web/packages/bartMachine/bartMachine.pdf
[28] Killick, R., Eckley, I., Haynes, K., Fearnhead, P.: Package changepoint (2014). http://cran.r-project.org/web/packages/changepoint/changepoint.pdf · Zbl 1319.68175
[29] Killick, R., Fearnhead, P., Eckley, I.: Optimal detection of changepoints with a linear computational cost. J. Am. Stat. Assoc. 107(500), 1590-1598 (2012) · Zbl 1258.62091 · doi:10.1080/01621459.2012.737745
[30] Lakshminarayanan, B., Roy, D.M., Teh, Y.W.: Particle Gibbs for Bayesian additive regression trees. arXiv preprint arXiv:1502.04622 (2015) · Zbl 0858.68080
[31] Lakshminarayanan, B., Roy, D.M., Teh, Y.W.: Mondrian forests for large-scale regression when uncertainty matters. In: Artificial Intelligence and Statistics, pp. 1478-1487. (arXiv:1506.03805, 2015) (2016) · Zbl 1110.68124
[32] Liaw, A., Matthew, W.: Package randomForest (2015). http://cran.r-project.org/web/packages/randomForest/randomForest.pdf
[33] Logothetis, C.J., Gallick, G.E., Maity, S.N., Kim, J., Aparicio, A., Efstathiou, E., Lin, S.H.: Molecular classification of prostate cancer progression: foundation for marker-driven treatment of prostate cancer. Cancer Discov. 3(8), 849-861 (2013) · doi:10.1158/2159-8290.CD-12-0460
[34] Lynch, C.: Big data: how do your data grow? Nature 455(7209), 28-29 (2008) · doi:10.1038/455028a
[35] Madigan, D., Raftery, A.E.: Model selection and accounting for model uncertainty in graphical models using Occam’s window. J. Am. Stat. Assoc. 89(428), 1535-1546 (1994) · Zbl 0814.62030 · doi:10.1080/01621459.1994.10476894
[36] Meinshausen, N.: Quantile regression forests. J. Mach. Learn. Res. 7, 983-999 (2006) · Zbl 1222.68262
[37] Morgan, J.N.: History and potential of binary segmentation for exploratory data analysis. J. Data Sci. 3, 123-136 (2005)
[38] Morgan, J.N., Sonquist, J.A.: Problems in the analysis of survey data and a proposal. J. Am. Stat. Assoc. 58(302), 415-434 (1963) · Zbl 0114.10103 · doi:10.1080/01621459.1963.10500855
[39] Nicodemus, K.K., Malley, J.D., Strobl, C., Ziegler, A.: The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinform. 11, 110 (2010). doi:10.1186/1471-2105-11-110 · doi:10.1186/1471-2105-11-110
[40] Norinder, U., Carlsson, L., Boyer, S., Eklund, M.: Introducing conformal prediction in predictive modeling. A transparent and flexible alternative to applicability domain determination. J. Chem. Inf. Model. 54(6), 1596-1603 (2014) · doi:10.1021/ci5001168
[41] Pratola, M.: Efficient Metropolis-Hastings proposal mechanisms for Bayesian regression tree models. Bayesian Anal. 11(3), 885-911 (2016) · Zbl 1357.62178 · doi:10.1214/16-BA999
[42] Quinlan, J.: Induction of decision trees. Mach. Learn. 1(1), 81-106 (1986). doi:10.1023/A:1022643204877 · doi:10.1023/A:1022643204877
[43] Quinlan, JR; Michie, D. (ed.), Discovering rules by induction from large collections of examples (1979), Edinburgh
[44] Raghavan, V., Bollmann, P., Jung, G.S.: A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans. Inf. Syst. (TOIS) 7(3), 205-229 (1989) · doi:10.1145/65943.65945
[45] Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461-464 (1978) · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[46] Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., Feuston, B.P.: Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43(6), 1947-1958 (2003). doi:10.1021/ci034160g · doi:10.1021/ci034160g
[47] Wager, S., Hastie, T., Efron, B.: Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J. Mach. Learn. Res. 15(1), 1625-1651 (2014) · Zbl 1319.62132
[48] Wilkinson, D.J.: Bayesian methods in bioinformatics and computational systems biology. Brief. Bioinform. 8(2), 109-16 (2007). doi:10.1093/bib/bbm007 · doi:10.1093/bib/bbm007
[49] Wu, Y., Tjelmeland, H., West, M.: Bayesian CART: prior specification and posterior simulation. J. Comput. Graph. Stat. 16(1), 44-66 (2007) · doi:10.1198/106186007X180426
[50] Yao, Y.: Estimation of a noisy discrete-time step function: Bayes and empirical Bayes approaches. Ann. Stat. 4(12), 1434-1447 (1984) · Zbl 0551.62069 · doi:10.1214/aos/1176346802
[51] Zhao, T., Liu, H., Roeder, K., Lafferty, J., Wasserman, L.: The huge package for high-dimensional undirected graph estimation in R. J. Mach. Learn. Res. 13(1), 1059-1062 (2012) · Zbl 1283.68311
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.