×

Using stacking to average Bayesian predictive distributions (with discussion). (English) Zbl 1407.62090

Summary: Bayesian model averaging is flawed in the \(\mathcal{M}\)-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions. We extend the utility function to any proper scoring rule and use Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions. We compare stacking of predictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), Pseudo-BMA, and a variant of Pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with bootstrapped-Pseudo-BMA as an approximate alternative when computation cost is an issue.

MSC:

62F15 Bayesian inference

Software:

Stan; ADVI; R
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Adams, J., Bishin, B. G., and Dow, J. K. (2004). “Representation in Congressional Campaigns: Evidence for Discounting/Directional Voting in U.S. Senate Elections.” Journal of Politics, 66(2): 348–373.
[2] Akaike, H. (1978). “On the likelihood of a time series model.” The Statistician, 217–235.
[3] Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. John Wiley & Sons. · Zbl 0796.62002
[4] Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). “Variational inference: A review for statisticians.” Journal of the American Statistical Association, 112(518): 859–877. · doi:10.1080/01621459.2017.1285773
[5] Breiman, L. (1996). “Stacked regressions.” Machine Learning, 24(1): 49–64. · Zbl 0849.68104
[6] Burnham, K. P. and Anderson, D. R. (2002). Model Selection and Multi-Model Inference: A Practical Information-Theoretic Approach. Springer, 2nd edition. · Zbl 1005.62007
[7] Clarke, B. (2003). “Comparing Bayes model averaging and stacking when model approximation error cannot be ignored.” Journal of Machine Learning Research, 4: 683–712. · Zbl 1102.68488
[8] Clyde, M. and Iversen, E. S. (2013). “Bayesian model averaging in the M-open framework.” In Damien, P., Dellaportas, P., Polson, N. G., and Stephens, D. A. (eds.), Bayesian Theory and Applications, 483–498. Oxford University Press.
[9] Fokoue, E. and Clarke, B. (2011). “Bias-variance trade-off for prequential model list selection.” Statistical Papers, 52(4): 813–833. · Zbl 1229.62072 · doi:10.1007/s00362-009-0289-6
[10] Geisser, S. and Eddy, W. F. (1979). “A Predictive Approach to Model Selection.” Journal of the American Statistical Association, 74(365): 153–160. · Zbl 0401.62036 · doi:10.1080/01621459.1979.10481632
[11] Gelfand, A. E. (1996). “Model determination using sampling-based methods.” In Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (eds.), Markov Chain Monte Carlo in Practice, 145–162. Chapman & Hall. · Zbl 0840.62003
[12] Gelman, A. (2004). “Parameterization and Bayesian modeling.” Journal of the American Statistical Association, 99(466): 537–545. · Zbl 1117.62343 · doi:10.1198/016214504000000458
[13] Gelman, A. and Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
[14] Gelman, A., Hwang, J., and Vehtari, A. (2014). “Understanding predictive information criteria for Bayesian models.” Statistics and Computing, 24(6): 997–1016. · Zbl 1332.62090 · doi:10.1007/s11222-013-9416-2
[15] George, E. I. (2010). “Dilution priors: Compensating for model space redundancy.” In Borrowing Strength: Theory Powering Applications – A Festschrift for Lawrence D. Brown, 158–165. Institute of Mathematical Statistics.
[16] Geweke, J. and Amisano, G. (2011). “Optimal prediction pools.” Journal of Econometrics, 164(1): 130–141. · Zbl 1441.62700 · doi:10.1016/j.jeconom.2011.02.017
[17] Geweke, J. and Amisano, G. (2012). “Prediction with misspecified models.” American Economic Review, 102(3): 482–486.
[18] Gneiting, T. and Raftery, A. E. (2007). “Strictly proper scoring rules, prediction, and estimation.” Journal of the American Statistical Association, 102(477): 359–378. · Zbl 1284.62093 · doi:10.1198/016214506000001437
[19] Gutiérrez-Peña, E. and Walker, S. G. (2005). “Statistical decision problems and Bayesian nonparametric methods.” International Statistical Review, 73(3): 309–330.
[20] Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). “Bayesian model averaging: A tutorial.” Statistical Science, 14(4): 382–401. · Zbl 1059.62525 · doi:10.1214/ss/1009212519
[21] Key, J. T., Pericchi, L. R., and Smith, A. F. M. (1999). “Bayesian model choice: What and why.” Bayesian Statistics, 6: 343–370. · Zbl 0956.62007
[22] Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). “Automatic differentiation variational inference.” Journal of Machine Learning Research, 18(1): 430–474. · Zbl 1437.62109
[23] Le, T. and Clarke, B. (2017). “A Bayes interpretation of stacking for M-complete and M-open settings.” Bayesian Analysis, 12(3): 807–829. · Zbl 1384.62298 · doi:10.1214/16-BA1023
[24] LeBlanc, M. and Tibshirani, R. (1996). “Combining estimates in regression and classification.” Journal of the American Statistical Association, 91(436): 1641–1650. · Zbl 0881.62046
[25] Li, M. and Dunson, D. B. (2016). “A framework for probabilistic inferences from imperfect models.” ArXiv e-prints:1611.01241.
[26] Liang, F., Paulo, R., Molina, G., Clyde, M. A., and Berger, J. O. (2008). “Mixtures of g priors for Bayesian variable selection.” Journal of the American Statistical Association, 103(481): 410–423. · Zbl 1335.62026 · doi:10.1198/016214507000001337
[27] Madigan, D., Raftery, A. E., Volinsky, C., and Hoeting, J. (1996). “Bayesian model averaging.” In Proceedings of the AAAI Workshop on Integrating Multiple Learned Models, 77–83.
[28] Merz, C. J. and Pazzani, M. J. (1999). “A principal components approach to combining regression estimates.” Machine Learning, 36(1–2): 9–32.
[29] Montgomery, J. M. and Nyhan, B. (2010). “Bayesian model averaging: Theoretical developments and practical applications.” Political Analysis, 18(2): 245–270.
[30] Piironen, J. and Vehtari, A. (2017). “Comparison of Bayesian predictive methods for model selection.” Statistics and Computing, 27(3): 711–735. · Zbl 1505.62321 · doi:10.1007/s11222-016-9649-y
[31] Rubin, D. B. (1981). “The Bayesian bootstrap.” Annals of Statistics, 9(1): 130–134.
[32] Smyth, P. and Wolpert, D. (1998). “Stacked density estimation.” In Advances in Neural Information Processing Systems, 668–674.
[33] Stan Development Team (2017). Stan modeling language: User’s guide and reference manual. Version 2.16.0, http://mc-stan.org/.
[34] Stone, M. (1977). “An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion.” Journal of the Royal Statistical Society. Series B (Methodological), 39(1): 44–47. · Zbl 0355.62002
[35] Ting, K. M. and Witten, I. H. (1999). “Issues in stacked generalization.” Journal of Artificial Intelligence Research, 10: 271–289. · Zbl 0915.68075 · doi:10.1613/jair.594
[36] Vehtari, A., Gelman, A., and Gabry, J. (2017a). “Pareto smoothed importance sampling.” ArXiv e-print:1507.02646.
[37] Vehtari, A., Gelman, A., and Gabry, J. (2017b). “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.” Statistics and Computing, 27(5): 1413–1432. · Zbl 1505.62408 · doi:10.1007/s11222-016-9696-4
[38] Vehtari, A. and Lampinen, J. (2002). “Bayesian model assessment and comparison using cross-validation predictive densities.” Neural Computation, 14(10): 2439–2468. · Zbl 1002.62029 · doi:10.1162/08997660260293292
[39] Vehtari, A. and Ojanen, J. (2012). “A survey of Bayesian predictive methods for model assessment, selection and comparison.” Statistics Surveys, 6: 142–228. · Zbl 1302.62011 · doi:10.1214/12-SS102
[40] Wagenmakers, E.-J. and Farrell, S. (2004). “AIC model selection using Akaike weights.” Psychonomic bulletin & review, 11(1): 192–196.
[41] Watanabe, S. (2010). “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research, 11: 3571–3594. · Zbl 1242.62024
[42] Wolpert, D. H. (1992). “Stacked generalization.” Neural Networks, 5(2): 241–259.
[43] Wong, H. and Clarke, B. (2004). “Improvement over Bayes prediction in small samples in the presence of model uncertainty.” Canadian Journal of Statistics, 32(3): 269–283. · Zbl 1061.62041 · doi:10.2307/3315929
[44] Yang, Y. and Dunson, D. B. (2014). “Minimax Optimal Bayesian Aggregation.” ArXiv e-prints:1403.1345.
[45] Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). “Supplementary Material to “Using stacking to average Bayesian predictive distributions”.” Bayesian Analysis. · doi:10.1214/17-BA1091
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.