×

Suboptimal behavior of Bayes and MDL in classification under misspecification. (English) Zbl 1470.62084

Summary: We show that forms of Bayesian and MDL inference that are often applied to classification problems can be inconsistent. This means that there exists a learning problem such that for all amounts of data the generalization errors of the MDL classifier and the Bayes classifier relative to the Bayesian posterior both remain bounded away from the smallest achievable generalization error. From a Bayesian point of view, the result can be reinterpreted as saying that Bayesian inference can be inconsistent under misspecification, even for countably infinite models. We extensively discuss the result from both a Bayesian and an MDL perspective.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Barron, A. R. (1985). Logically smooth density estimation. PhD thesis, Department of EE, Stanford University, Stanford, Ca.
[2] Barron, A. R. (1998). Information-theoretic characterization of Bayes performance and the choice of priors in parametric and nonparametric problems. In Bayesian statistics, vol. 6 (pp. 27-52). Oxford University Press. · Zbl 0974.62020
[3] Barron, A. R., Rissanen, J., & Yu, B. (1998). The MDL principle in coding and modeling. IEEE Trans. Inform. Theory, 44 (6), 2743-2760. · Zbl 0933.94013 · doi:10.1109/18.720554
[4] Barron, A. R. (1990). Complexity regularization with application to artificial neural networks. In Nonparametric functional estimation and related topics (pp. 561-576). Kluwer Academic Publishers. · Zbl 0739.62001
[5] Barron, A. R., & Cover, T. M. (1991). Minimum complexity density estimation. IEEE Trans. Inform. Theory, 37 (4), 1034-1054. · Zbl 0743.62003 · doi:10.1109/18.86996
[6] Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. John Wiley. · Zbl 0796.62002
[7] Blackwell, D., & Dubins, L. (1962). Merging of opinions with increasing information. The Annals of Mathematical Statistics, 33, 882-886. · Zbl 0109.35704
[8] Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. (1987). Occam’s razor. Information Processing Letters, 24, 377-380. · Zbl 0653.68084 · doi:10.1016/0020-0190(87)90114-1
[9] Bunke, O., & Milhaud, X. (1998). Asymptotic behaviour of Bayes estimates under possibly incorrect models. The Annals of Statistics, 26, 617-644. · Zbl 0929.62022 · doi:10.1214/aos/1028144851
[10] Clarke, B. (2004). Comparing Bayes and non-Bayes model averaging when model approximation error cannot be ignored. Journal of Machine Learning Research, 4 (4), 683-712. · Zbl 1102.68488 · doi:10.1162/153244304773936090
[11] Comley, J. W., & Dowe, D. L. Minimum message length and generalised bayesian nets with asymmetric languages. In P. D. Grünwald, I. J. Myung, & M. A. Pitt (Eds.), Advances in minimum description length: theory and applications. MIT Press, 2005. · Zbl 0653.68084
[12] Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley. · Zbl 0762.94001
[13] Diaconis, P., & Freedman, D. (1986). On the consistency of Bayes estimates. The Annals of Statistics, 14 (1), 1-26. · Zbl 0595.62022
[14] Doob, J. L. (1949). Application of the theory of martingales. In Le Calcul de Probabilités et ses Applications. Colloques Internationaux du Centre National de la Recherche Scientifique (pp. 23-27), Paris. · Zbl 0929.62022
[15] Grünwald, P. D. (2007). The minimum description length principle. Cambridge, MA: MIT Press.
[16] Grünwald, P. D. (1998). The minimum description length principle and reasoning under uncertainty. PhD thesis, University of Amsterdam, The Netherlands. · Zbl 0743.62003
[17] Grünwald, P. D. (2005). MDL tutorial. In P. D. Grünwald, I. J. Myung, & M. A. Pitt (Eds.), Advances in minimum description length: theory and applications. MIT Press · Zbl 0933.94013
[18] Heckerman, D., Chickering, D.M., Meek, C., Rounthwaite, R., & Kadie, C. (2000). Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research, 1, 49-75. · Zbl 1008.68132 · doi:10.1162/153244301753344614
[19] Hutter, M. (2005). Fast non-parametric Bayesian inference on infinite trees. In Proceedings of the 15th international workshop on artificial intelligence and statistics (AISTATS ’05). · Zbl 0850.94027
[20] Jordan, M. I. (1995). Why the logistic function? a tutorial discussion on probabilities and neural networks. Computational Cognitive Science Tech. Rep. 9503, MIT.
[21] Kearns, M., Mansour, Y., Ng, A.Y., & Ron, D. (1997). An experimental and theoretical comparison of model selection methods. Machine Learning, 27, 7-50. · doi:10.1023/A:1007344726582
[22] Kleijn, B., & van der Vaart, A. (2004). Misspecification in infinite-dimensional Bayesian statistics. submitted. · Zbl 1095.62031
[23] Li, J. K. (1997). Estimation of mixture models. PhD thesis, Yale University, Department of Statistics. · Zbl 0935.94005
[24] McAllester, D. (1999). PAC-Bayesian model averaging. In Proceedings COLT ’99. · Zbl 0945.68157
[25] Meir, R., & Merhav, N. (1995). On the stochastic complexity of learning realizable and unrealizable rules. Machine Learning, 19, 241-261. · Zbl 0830.68109
[26] Platt, J. C. (1999). Probabilities for SV machines. In A. Smola, P. Bartlett, B. Schöelkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 61-74). MIT Press.
[27] Quinlan, J., & Rivest, R. (1989). Inferring decision trees using the minimum description length principle. Information and Computation, 80, 227-248. · Zbl 0664.94015 · doi:10.1016/0890-5401(89)90010-2
[28] Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. The Annals of Statistics, 11, 416-431. · Zbl 0513.62005
[29] Rissanen, J. (1989). Stochastic complexity in statistical inquiry. World Scientific. · Zbl 0800.68508
[30] Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211-244. · Zbl 0997.68109 · doi:10.1162/15324430152748236
[31] Viswanathan, M., Wallace, C. S., Dowe, D. L., & Korb, K. B. (1999). Finding cutpoints in noisy binary sequences - a revised empirical evaluation. In Proc. 12th Australian joint conf. on artif. intelligence, vol. 1747 of Lecture notes in artificial intelligence (LNAI) (pp. 405-416), Sidney, Australia.
[32] Wallace, C. S. (2005). Statistical and Inductive Inference by Minimum Message Length. New York: Springer. · Zbl 1085.62002
[33] Wallace, C. S., & Boulton, D. M. (1968). An information measure for classification. Computing Journal, 11, 185-195. · Zbl 0164.46208
[34] Wallace, C. S., & Dowe, D. L. (1999a). Minimum message length and Kolmogorov complexity. Computer Journal, 42 (4), 270-283. Special issue on Kolmogorov complexity. · Zbl 0946.68062
[35] Wallace, C. S., & Dowe, D. L. (1999b). Refinements of MDL and MML coding. Computer Journal, 42 (4), 330-337. Special issue on Kolmogorov complexity. · Zbl 0937.68065
[36] Wallace, C. S., & Patrick, J. D. (1993). Coding decision trees. Machine Learning, 11, 7-22. · Zbl 0850.94027 · doi:10.1023/A:1022646101185
[37] Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Trans. Inform. Theory, 44(4), 1424-1439. · Zbl 0935.94005 · doi:10.1109/18.681319
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.