×

Catching up faster by switching sooner: a predictive approach to adaptive estimation with an application to the AIC-BIC dilemma. With discussion and authors’ reply. (English) Zbl 1411.62073

Summary: Prediction and estimation based on Bayesian model selection and model averaging, and derived methods such as the Bayesian information criterion BIC, do not always converge at the fastest possible rate. We identify the catch-up phenomenon as a novel explanation for the slow convergence of Bayesian methods, which inspires a modification of the Bayesian predictive distribution, called the switch distribution. When used as an adaptive estimator, the switch distribution does achieve optimal cumulative risk convergence rates in non-parametric density estimation and Gaussian regression problems. We show that the minimax cumulative risk is obtained under very weak conditions and without knowledge of the underlying degree of smoothness. Unlike other adaptive model selection procedures such as the Akaike information criterion AIC and leave-one-out cross-validation, BIC and Bayes factor model selection are typically statistically consistent. We show that this property is retained by the switch distribution, which thus solves the AIC-BIC dilemma for cumulative risk. The switch distribution has an efficient implementation. We compare its performance with AIC, BIC and Bayesian model selection and averaging on a regression problem with simulated data.

MSC:

62F15 Bayesian inference
62B10 Statistical aspects of information-theoretic topics
62F35 Robustness and adaptive procedures (parametric inference)
62L12 Sequential estimation
62-02 Research exposition (monographs, survey articles) pertaining to statistics
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Akaike, H. ( 1974) A new look at statistical model identification. IEEE Trans. Autom. Control, 19, 716– 723. · Zbl 0314.62039
[2] Akaike, H. ( 1979) A Bayesian extension of the minimum AIC procedure of autoregressive model fitting. Biometrika, 66, 237– 242. · Zbl 0407.62064
[3] Barron, A. R. ( 1998) Information‐theoretic characterization of Bayes performance and the choice of priors in parametric and nonparametric problems. In Bayesian Statistics 6 (eds J. M. Bernardo, J. O. Berger and A. F. M. Smith), pp. 27– 52. Oxford: Clarendon. · Zbl 0974.62020
[4] Barron, A. and Cover, T. ( 1991) Minimum complexity density estimation. IEEE Trans. Inform. Theor., 37, 1034– 1054. · Zbl 0743.62003
[5] Barron, A., Rissanen, J. and Yu, B. ( 1998) The minimum description length principle in coding and modeling. IEEE Trans. Inform. Theor., 44, 2743– 2760. · Zbl 0933.94013
[6] Barron, A. and Sheu, C. ( 1991) Approximation of density functions by sequences of exponential families. Ann. Statist., 19, 1347– 1369. · Zbl 0739.62027
[7] Barron, A., Yang, Y. and Yu, B. ( 1994) Asymptotically optimal function estimation by minimum complexity criteria. In Proc. Int. Symp. Information Theory, Trondheim, p. 38. New York: Institute of Electrical and Electronics Engineers.
[8] Bernardo, J. and Smith, A. ( 1994) Bayesian Theory. Chichester: Wiley.
[9] Box, G. E. P. and Tiao, G. C. ( 1973) Bayesian Inference in Statistical Analysis. Reading: Addison‐Wesley. · Zbl 0271.62044
[10] Burnham, K. P. and Anderson, D. R. ( 2002) Model Selection and Multimodel Inference, 2nd edn. New York: Springer. · Zbl 1005.62007
[11] Cesa‐Bianchi, N. and Lugosi, G. ( 2006) Prediction, Learning and Games. Cambridge: Cambridge University Press. · Zbl 1114.91001
[12] Clarke, B. ( 1997) Online forecasting proposal. Technical Report. University of Dortmund, Dortmund.
[13] Clarke, B. S. and Barron, A. R. ( 1990) Information‐theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theor., 36, 453– 471. · Zbl 0709.62008
[14] Clarke, B. and Barron, A. ( 1994) Jeffreys’ prior is asymptotically least favorable under entropy risk. J. Statist. Planng Inf., 41, 37– 60. · Zbl 0820.62006
[15] Dawid, A. P. ( 1984) Statistical theory: the prequential approach. J. R. Statist. Soc. A, 147, 278– 292. · Zbl 0557.62080
[16] Dawid, A. P. ( 1992a) Prequential analysis, stochastic complexity and Bayesian inference. In Bayesian Statistics (eds J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. A. Smith), pp. 109– 125. Oxford: Clarendon.
[17] Dawid, A. ( 1992b) Prequential data analysis. In Current Issues in Statistical Inference: Essays in Honor of D. Basu (eds M. Gosh and P. Pathak), pp. 113– 125. Hayward: Institute of Mathematical Statistics. · Zbl 0850.62091
[18] De Luna, X. and Skouras, K. ( 2003) Choosing a model selection strategy. Scand. J. Statist., 30, 113– 128. · Zbl 1034.62032
[19] Diaconis, P. and Freedman, D. ( 1986) On the consistency of Bayes estimates. Ann. Statist., 14, 1– 26. · Zbl 0595.62022
[20] Donoho, D. and Johnstone, I. ( 1994) Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81, 425– 455. · Zbl 0815.62019
[21] van Erven, T. ( 2010) When data compression and statistics disagree: two frequentist challenges for the minimum description length principle. PhD Thesis. Leiden University, Leiden.
[22] van Erven, T., Grünwald, P. D. and de Rooij, S. ( 2008) Catching up faster by switching sooner: a prequential solution to the AIC‐BIC dilemma. Preprint arXiv:0807.1005. Centrum voor Wiskunde en Informatica, Amsterdam
[23] Forster, M. ( 2001) The new science of simplicity. In Simplicity, Inference and Modelling (eds A. Zellner, H. Keuzenkamp and M. McAleer), pp. 83– 117. Cambridge: Cambridge University Press.
[24] Forster, D. and George, E. ( 1994) The risk inflation criterion for multiple regression. Ann. Statist., 22, 1947– 1975. · Zbl 0829.62066
[25] Ghosal, S., Lember, J. and van der Vaart, A. ( 2008) Nonparametric Bayesian model selection and averaging. Electron. J. Statist., 2, 63– 89. · Zbl 1135.62028
[26] Grüwald, P. D. ( 2007) The Minimum Description Length Principle. Cambridge: MIT Press.
[27] Hansen, M. and Yu, B. ( 2001) Model selection and the principle of minimum description length. J. Am. Statist. Ass., 96, 746– 774. · Zbl 1017.62004
[28] Hansen, M. and Yu, B. ( 2002) Minimum description length model selection criteria for generalized linear models. In Science and Statistics: Festschrift for Terry Speed. Hayward: Institute for Mathematical Statistics.
[29] Haussler, D. and Opper, M. ( 1997) Mutual information, metric entropy and cumulative relative entropy risk. Ann. Statist., 25, 2451– 2492. · Zbl 0920.62007
[30] Herbster, M. and Warmuth, M. K. ( 1998) Tracking the best expert. Mach. Learn., 32, 151– 178. · Zbl 0912.68165
[31] Kass, R. E. and Raftery, A. E. ( 1995) Bayes factors. J. Am. Statist. Ass., 90, 773– 795. · Zbl 0846.62028
[32] Kontkanen, P., Myllymäki, P., Silander, T., Tirri, H. and Grünwald, P. D. ( 2000) On predictive distributions and Bayesian networks. J. Statist. Comput., 10, 39– 54.
[33] Koolen, W. and de Rooij, S. ( 2008a) Combining expert advice efficiently. In Proc. 21st A. Conf. Computational Learning Theorey.
[34] Koolen, W. and de Rooij, S. ( 2008b) Combining expert advice efficiently. Preprint arXiv abs/0802.2015.
[35] Li, K. ( 1987) Asymptotic optimality for ##img## cross‐validation and generalized cross‐validation: discrete index set. Ann. Statist., 15, 958– 975. · Zbl 0653.62037
[36] Poland, J. and Hutter, M. ( 2005) Asymptotics of discrete MDL for online prediction. IEEE Trans. Inform. Theor., 51, 3780– 3795. · Zbl 1318.68101
[37] Rissanen, J. ( 1984) Universal coding, information, prediction, and estimation. IEEE Trans. Inform. Theor., 30, 629– 636. · Zbl 0574.62003
[38] Rissanen, J.Speed, T. P. and Yu, B. ( 1992) Density estimation by stochastic complexity. IEEE Trans. Inform. Theor., 38, 315– 323. · Zbl 0743.62004
[39] Schwarz, G. ( 1978) Estimating the dimension of a model. Ann. Statist., 6, 461– 464. · Zbl 0379.62005
[40] Shibata, R. ( 1983) Asymptotic mean efficiency of a selection of regression variables. Ann. Inst. Statist. Math., 35, 415– 423. · Zbl 0563.62043
[41] Shiryaev, A. N. ( 1996) Probability. Berlin: Springer. · Zbl 0909.01009
[42] Sober, E. ( 2004) The contest between parsimony and likelihood. Syst. Biol., 4, 644– 653.
[43] Speed, T. and Yu, B. ( 1993) Model selection and prediction: normal regression. Ann. Inst. Statist. Math., 45, 35– 54. · Zbl 0774.62093
[44] Stone, M. ( 1977) An asymptotic equivalence of choice of model by cross‐validation and Akaike’s criterion. J. R. Statist. Soc. B, 39, 44– 47. · Zbl 0355.62002
[45] Tibshirani, R. ( 1996) Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58, 267– 288. · Zbl 0850.62538
[46] Volf, P. and Willems, F. ( 1998) Switching between two universal source coding algoriths. In Proc. Data Compression Con., Snowbird, pp. 491– 500.
[47] Vovk, V. ( 1999) Derandomizing stochastic prediction strategies. Mach. Learn., 35, 247– 282. · Zbl 0941.68128
[48] Wong, H. and Clarke, B. ( 2004) Improvement over Bayes prediction in small samples in the presence of model uncertainty. Can. J. Statist., 32, 269– 283. · Zbl 1061.62041
[49] Yang, Y. ( 1999) Model selection for nonparametric regression. Statist. Sin., 9, 475– 499. · Zbl 0921.62051
[50] Yang, Y. ( 2000) Mixing strategies for density estimation. Ann. Statist., 28, 75– 87. · Zbl 1106.62322
[51] Yang, Y. ( 2005) Can the strengths of AIC and BIC be shared? Biometrika, 92, 937– 950. · Zbl 1151.62301
[52] Yang, Y. ( 2007a) Consistency of cross‐validation for comparing regression procedures. Ann. Statist., 35, 2450– 2473. · Zbl 1129.62039
[53] Yang, Y. ( 2007b) Prediction estimation with simple linear models: is it really that simple?Econmetr. Theor., 23, 1– 36. · Zbl 1441.62907
[54] Yang, Y. and Barron, A. ( 1999) Information‐theoretic determination of minimax rates of convergence. Ann. Statist., 27, 1564– 1599. · Zbl 0978.62008
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.