×

Minimum description length revisited. (English) Zbl 1476.62020

Summary: This is an up-to-date introduction to and overview of the Minimum Description Length (MDL) Principle, a theory of inductive inference that can be applied to general problems in statistics, machine learning and pattern recognition. While MDL was originally [J. Rissanen, Ann. Stat. 11, 416–431 (1983; Zbl 0513.62005)] based on data compression ideas, this introduction can be read without any knowledge thereof. It takes into account all major developments since 2007, the last time an extensive overview was written. These include new methods for model selection and averaging and hypothesis testing, as well as the first completely general definition of MDL estimators. Incorporating these developments, MDL can be seen as a powerful extension of both penalized likelihood and Bayesian approaches, in which penalization functions and prior distributions are replaced by more general luckiness functions, average-case methodology is replaced by a more robust worst-case approach, and in which methods classically viewed as highly distinct, such as AIC versus BIC and cross-validation versus Bayes can, to a large extent, be viewed from a unified perspective.

MSC:

62A01 Foundations and philosophical topics in statistics
62B10 Statistical aspects of information-theoretic topics
62F07 Statistical ranking and selection procedures
62M20 Inference from stochastic processes and prediction

Citations:

Zbl 0513.62005

Software:

ElemStatLearn; Krimp
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Rissanen, J., Modeling by the shortest data description, Automatica14 (1978) 465-471. · Zbl 0418.93079
[2] Rissanen, J., Stochastic Complexity in Statistical Inquiry (World Scientific, Hackensack, 1989). · Zbl 0800.68508
[3] Barron, A., Rissanen, J. and Yu, B., The minimum description length principle in coding and modeling, IEEE Trans. Inform. Theory44(6) (1998) 2743-2760. · Zbl 0933.94013
[4] Grünwald, P., The Minimum Description Length Principle (The MIT Press, Cambridge, 2007).
[5] Grünwald, P. D., Myung, I. J. and Pitt, M. A., Advances in Minimum Description Length: Theory and Applications (The MIT Press, 2005).
[6] Wallace, C. S. and Boulton, D. M., An information measure for classification, Comput. J.11 (1968) 185-195. · Zbl 0164.46208
[7] Vitányi, P. M. B. and Li, M., Minimum description length induction, Bayesianism, and Kolmogorov complexity, IEEE Trans. Inform. TheoryIT-46(2) (2000) 446-464. · Zbl 1013.68096
[8] T. F. Sterkenburg, Universal prediction: A philosophical investigation, Ph.D. thesis, University of Groningen (2018).
[9] Kass, R. E. and Raftery, A. E., Bayes factors, J. Am. Stat. Assoc.90(430) (1995) 773-795. · Zbl 0846.62028
[10] Shtarkov, Yu. M., Universal sequential coding of single messages, Probl. Inf. Transm.23(3) (1987) 3-17. · Zbl 0668.94005
[11] A. Suzuki and K. Yamanishi, Exact calculation of normalized maximum likelihood code length using Fourier analysis, arXiv:1801.03705 [math.ST].
[12] Rissanen, J., Fisher information and stochastic complexity, IEEE Trans. Inform. Theory42(1) (1996) 40-47. · Zbl 0856.94006
[13] Grünwald, P. and Mehta, N., A tight excess risk bound via a unified PAC-Bayesian-Rademacher-Shtarkov-MDL complexity, in Proc. Thirtieth Conf. Algorithmic Learning Theory (ALT 2019) (2019). arXiv:1720.07732 [stat.ME]
[14] Rissanen, J., Universal coding, information, prediction and estimation, IEEE Trans. Inform. Theory30 (1984) 629-636. · Zbl 0574.62003
[15] Dawid, A. P., Present position and potential developments: Some personal views, statistical theory, the prequential approach, J. R. Stat. Soc. A147(2) (1984) 278-292. · Zbl 0557.62080
[16] van Erven, T., Grünwald, P. D. and de Rooij, S., Catching up faster in Bayesian model selection and model averaging, in Advances in Neural Information Processing Systems, Vol. 20 (Curran Associates, Inc.2008), pp. 417-424.
[17] P. Grünwald, R. de Heide and W. Koolen, Safe testing, arXiv: 1906. 07801 [math.ST].
[18] Rasmussen, C. E. and Ghahramani, Z., Occam’s razor, in Advances in Neural Information Processing Systems, Vol. 13 (The MIT Press, 2000), pp. 294-300.
[19] Myung, I. J., Balasubramanian, V. and Pitt, M. A., Counting probability distributions: Differential geometry and model selection, Proc. Natl. Acad. Sci. USA97 (2000) 11170-11175. · Zbl 0997.62099
[20] Cover, T. M. and Thomas, J. A., Elements of Information Theory (Wiley-Interscience, New York, 1991). · Zbl 0762.94001
[21] Bartlett, P., Grünwald, P., Harremoës, P., Hedayati, F. and Kotlowski, W., Horizon-independent optimal prediction with log-loss in exponential families, in Proc. 26th Conf. Learning Theory (COLT 2013) (2013).
[22] Hastie, T., Tibshirani, R. and Friedman, J., The Elements of Statistical Learning: Data Mining, Inference and Prediction (Springer Verlag, 2001). · Zbl 0973.62007
[23] Miyaguchi, K. and Yamanishi, K., High-dimensional penalty selection via minimum description length principle, Mach. Learn.107(8-10) (2018) 1283-1302. · Zbl 1477.62137
[24] Cesa-Bianchi, N. and Lugosi, G., Prediction, Learning and Games (Cambridge University Press, Cambridge, 2006). · Zbl 1114.91001
[25] Dawid, A P., The geometry of proper scoring rules, Ann. Inst. Stat. Math.59(1) (2007) 77-93. · Zbl 1108.62009
[26] Shalev-Shwartz, S. and Ben-David, S., Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, 2014). · Zbl 1305.68005
[27] Grünwald, P., Safe probability, J. Stat. Plan. Inference195 (2018) 47-63. · Zbl 1383.62014
[28] Yang, Y., Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation, Biometrica92(4) (2005) 937-950. · Zbl 1151.62301
[29] van Erven, T., Grünwald, P. D. and de Rooij, S., Catching up faster by switching sooner: A predictive approach to adaptive estimation with an application to the AIC-BIC dilemma, J. R. Stat. Soc. B (Stat. Methodol.)74(3) (2012) 361-417. · Zbl 1411.62073
[30] van der Pas, S. and Grünwald, P. D., Almost the best of three worlds: Risk, consistency and optional stopping for the switch criterion in nested model selection, Stat. Sin.28 (2018) 229-253. · Zbl 1382.62006
[31] Barron, A., Roos, T. and Watanabe, K., Bayesian properties of normalized maximum likelihood and its fast computation, in Proc. IEEE Int. Symp. Information Theory (IEEE Press, 2014), pp. 1667-1671.
[32] Takeuchi, J. and Barron, A. R., Robustly minimax codes for universal data compression, in Proc. Twenty-First Symp. Information Theory and Its Applications (SITA ’98) (1998).
[33] Kotłowski, W., Grünwald, P. and de Rooij, S., Following the flattened leader, in Proc. 23rd Conf. Learning Theory (COLT) (2010), pp. 106-118.
[34] Grünwald, P. and Kotłowski, W., Prequential plug-in codes that achieve optimal redundancy rates even if the model is wrong, in Proc. 2010 Int. Symp. Information Theory (ISIT) (2010).
[35] Roos, T. and Rissanen, J., On sequentially normalized maximum likelihood models, in Proc. First Workshop Information Theoretic Methods in Science and Engineering (WITMSE-2008) (Tampere International Center for Signal Processing, 2008).
[36] Rissanen, J., Roos, T. and Myllymäki, P., Model selection by sequentially normalized least squares, J. Multivariate Anal.101(4) (2010) 839-849. · Zbl 1181.62144
[37] Takimoto, E. and Warmuth, M., The last-step minimax algorithm, in Proc. Eleventh Int. Conf. Algorithmic Learning Theory (ALT-2000) (2000). · Zbl 1046.68560
[38] Määttä, J., Schmidt, D. F. and Roos, T., Subset selection in linear regression using sequentially normalized least squares: Asymptotic theory, Scand. J. Stat.43(2) (2016) 382-395. · Zbl 1384.62239
[39] Määttä, J. and Roos, T., Robust sequential prediction in linear regression with Student’s \(t\)-distribution, in Proc. Fourteenth Int. Symp. Artificial Intelligence and Mathematics (ISAIM-2016) (2016).
[40] Watanabe, K. and Roos, T., Achievability of asymptotic minimax regret by horizon-dependent and horizon-independent strategies, J. Mach. Learn. Res.16(11) (2015) 2357-2375. · Zbl 1351.62017
[41] J. Q. Li, Estimation of mixture models, Ph.D. thesis, Yale University, New Haven, CT (1999).
[42] Li, J. Q. and Barron, A. R., Mixture density estimation, in Advances in Neural Information Processing Systems, eds. Solla, S. A., Leen, T. K. and Müller, K.-R., Vol. 12 (The MIT Press, Cambridge, 2000), pp. 279-285.
[43] Berger, J. O., Pericchi, L. R. and Varshavsky, J. A., Bayes factors and marginal distributions in invariant situations, Sankhyā, Indian J. Stat. A (1961-2002)60(3) (1998) 307-321. · Zbl 0973.62017
[44] Dass, S. C. and Berger, J. O., Unified conditional frequentist and Bayesian testing of composite hypotheses, Scand. J. Stat.30(1) (2003) 193-210. · Zbl 1034.62009
[45] Ly, A., Verhagen, J. and Wagenmakers, E. J., Harold Jeffreys’ default Bayes factor hypothesis tests: Explanation, extension, and application in psychology, J. Math. Psychol.72 (2016) 19-32. · Zbl 1357.62117
[46] Koller, D. and Friedman, N., Probabilistic Graphical Models: Principles and Techniques (The MIT Press, 2009). · Zbl 1183.68483
[47] Geiger, D. and Heckerman, D., A characterization of the Dirichlet distribution with application to learning Bayesian networks, in Proc. 11th Conf. Uncertainty in Artificial Intelligence (UAI-1995) (1995), pp. 196-207.
[48] Heckerman, D., Geiger, D. and Chickering, D., Learning Bayesian networks: The combination of knowledge and statistical data, Mach. Learn.20(3) (1995) 197-243. · Zbl 0831.68096
[49] Lam, W. and Bacchus, F., Learning Bayesian belief networks: An approach based on the MDL principle, Comput. Intell.10(3) (1994) 269-293.
[50] Bouckaert, R. R., Probabilistic network construction using the minimum description length principle, in Symbolic and Quantitative Approaches to Reasoning and Uncertainty, Clarke, M., Kruse, R. and Moral, S., eds. , Vol. 747 (Springer, 2005), pp. 41-48.
[51] Miyaguchi, K., Matsushima, S. and Yamanishi, K., Sparse graphical modeling via stochastic complexity, in Proc. 2017 SIAM Int. Conf. Data Mining (SDM2017) (2017), pp. 723-731.
[52] Silander, T., Roos, T. and Myllymäki, P., Learning locally minimax optimal Bayesian networks, Int. J. Approx. Reason.51(5) (2010) 544-557.
[53] Kontkanen, P. and Myllymäki, P., A linear-time algorithm for computing the multinomial stochastic complexity, Inf. Process. Lett.103(6) (2007) 227-233. · Zbl 1184.68266
[54] Silander, T., Leppä-aho, J., Jääsaari, E. and Roos, T., Quotient normalized maximum likelihood criterion for learning Bayesian network structures, in Proc. Int. Conf. Artificial Intelligence and Statistics (2018), pp. 948-957. · Zbl 1407.62042
[55] Eggeling, R., Roos, T., Myllymäki, P. and Grosse, I., Robust learning of inhomogeneous PMMs, in Proc. Seventeenth Int. Conf. Artificial Intelligence and Statistics (2014), pp. 229-237.
[56] Eggeling, R., Roos, T., Myllymäki, P. and Grosse, I., Inferring intra-motif dependencies of DNA binding sites from chip-seq data, BMC Bioinformatics16 (2015) 375:1-375:15.
[57] Suzuki, J., Jeffreys’ and BDeu priors for model selection, in Proc. 9th Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-2016) (2016).
[58] Szpankowski, W. and Weinberger, M. J., Minimax pointwise redundancy for memoryless models over large alphabets, IEEE Trans. Inform. Theory58(7) (2012) 4094-4104. · Zbl 1365.94231
[59] Roos, T., Monte Carlo estimation of minimax regret with an application to MDL model selection, in Proc. IEEE Information Theory Workshop 2008 (ITW-2008) (IEEE Press, 2008), pp. 284-288.
[60] Zou, Y. and Roos, T., On model selection, Bayesian networks, and the Fisher information integral, New Generat. Comput.35(1) (2017) 5-27. · Zbl 1450.62007
[61] Wu, T., Sugawara, S. and Yamanishi, K., Decomposed normalized maximum likelihood codelength criterion for selecting hierarchical latent variable models, in Proc. ACM Int. Conf. Knowledge Discovery and Data Mining (2017).
[62] S. Hirai and K. Yamanishi, An upper bound on normalized maximum likelihood codes for Gaussian mixture models, arXiv:1709.00925 [CS.IT]. · Zbl 1364.94754
[63] Hirai, S. and Yamanishi, K., Efficient computation of normalized maximum likelihood codes for Gaussian mixture models with its applications to clustering, IEEE Trans. Inform. Theory59(11) (2013) 7718-7727. · Zbl 1364.94754
[64] Suzuki, A., Miyaguchi, K. and Yamanishi, K., Structure selection for convolutive non-negative matrix factorization using normalized maximum likelihood coding, in Proc. 2016 IEEE 16th Int. Conf. Data Mining (ICDM) (IEEE, 2016), pp. 1221-1226.
[65] Miyamoto, K., Barron, A. R. and Takeuchi, J., Improved MDL estimators using local exponential family bundles applied to mixture families, in Proc. Int. Symp. Information Theory 2019 (2019).
[66] Watanabe, S., Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory, J. Mach. Learn. Res.11 (2010) 3571-3594. · Zbl 1242.62024
[67] Watanabe, S., A widely applicable Bayesian information criterion, J. Mach. Learn. Res.14 (2013) 867-897. · Zbl 1320.62058
[68] Barron, A. R. and Cover, T. M., Minimum complexity density estimation, IEEE Trans. Inform. Theory37(4) (1991) 1034-1054. · Zbl 0743.62003
[69] Zhang, T., From \(\epsilon \)-entropy to KL entropy: Analysis of minimum information complexity density estimation, Ann. Stat.34(5) (2006) 2180-2210. · Zbl 1106.62005
[70] Zhang, T., Information theoretical upper and lower bounds for statistical estimation, IEEE Trans. Inform. Theory52(4) (2006) 1307-1321. · Zbl 1320.94033
[71] Barron, A. and Luo, X., MDL procedures with \(L1\) penalty and their statistical risk, in Proc. 2008 Workshop on Information Theoretic Methods in Science and Engineering (2008).
[72] Barron, A., Huang, C., Li, J. and Luo, X., The MDL principle, penalized likelihoods, and statistical risk, in Festschrift in Honor of Jorma Rissanen on the Occasion of his 75th Birthday, eds. Grünwald, P., et al., (Tampere International Center for Signal Processing, 2008), pp. 33-63.
[73] Chatterjee, S. and Barron, A., Information theoretic validity of penalized likelihood, in Proc. 2014 IEEE Int. Symp. Information Theory (ISIT) (IEEE, 2014), pp. 3027-3031.
[74] Kawakita, M. and Takeuchi, J.-I., Barron and Cover’s theory in supervised learning and its application to Lasso, in Proc. 33rd Int. Conf. Machine Learning (2016), pp. 1958-1966.
[75] Brinda, W. and Klusowski, J., Finite-sample risk bounds for maximum likelihood estimation with arbitrary penalties, IEEE Trans. Inform. Theory64(4) (2018) 2727-2741. · Zbl 1391.94887
[76] Rissanen, J., Complexity of models, in Complexity, Entropy and the Physics of Information, ed. Zurek, W. H. (Addison-Wesley, 1991), pp. 117-125.
[77] Grünwald, P. and Langford, J., Suboptimal behavior of Bayes and MDL in classification under misspecification, Mach. Learn.66(2-3) (2007) 119-149. · Zbl 1078.68049
[78] Grünwald, P. and van Ommen, T., Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it, Bayesian Anal.12(4) (2017) 1069-1103. · Zbl 1384.62088
[79] R. de Heide, The Safe-Bayesian Lasso, Master’s thesis, Leiden University (2016).
[80] Grünwald, P. D., Viewing all models as “probabilistic”, in Proc. Twelfth ACM Conf. Computational Learning Theory (COLT’ 99) (1999), pp. 171-182.
[81] Grünwald, P., The safe Bayesian: Learning the learning rate via the mixability gap, in Proc. 23rd Int. Conf. Algorithmic Learning Theory (ALT ’12) (Springer, 2012), pp. 169-183. · Zbl 1255.68086
[82] W. Zhou, V. Veitch, M. Austern, R. Adams and P. Orbanz, Compressibility and generalization in large-scale deep learning, arXiv:1804.05862 [Stat.ML].
[83] Hochreiter, S. and Schmidhuber, J., Flat minima, Neural Comput.9(1) (1997) 1-42. · Zbl 0872.68150
[84] Hinton, G. E. and van Camp, D., Keeping the neural networks simple by minimizing the description length of the weights, in Proc. Sixth Annu. Conf. Computational Learning Theory (ACM, 1993), pp. 5-13.
[85] McAllester, D., PAC-Bayesian stochastic model selection, Mach. Learn.51(1) (2003) 5-21. · Zbl 1056.68122
[86] Blum, A. and Langford, J., PAC-MDL bounds, in Proc. Sixteenth Conf. Learning Theory (COLT’ 03) (2003), pp. 344-357. · Zbl 1274.68293
[87] Dziugaite, G. K. and Roy, D. M., Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data, in Proc. Thirty-Third Conf. Uncertainty in Artificial Intelligence (UAI ’17) (2017).
[88] Langford, J. and Caruana, R., (Not) bounding the true error, in Advances in Neural Information Processing Systems 14, eds. Dietterich, T. G., Becker, S., and Ghahramani, Z. (The MIT Press, 2002), pp. 809-816.
[89] Rissanen, J., Information and Complexity in Statistical Modeling (Springer-Verlag, New York, 2007). · Zbl 1156.62005
[90] Vreeken, J., van Leeuwen, M. and Siebes, A., Krimp: Mining itemsets that compress, Data Min. Knowl. Disc.23(1) (2011) 169-214. · Zbl 1235.68071
[91] Koutra, D., Kang, U., Vreeken, J. and Faloutsos, C., Summarizing and understanding large graphs, Stat. Anal. Data Min., ASA Data Sci. J.8(3) (2015) 183-202. · Zbl 07260433
[92] Budhathoki, K., Vreeken, J. and Origo, J., Causal inference by compression, Knowl. Inf. Syst.56(2) (2018) 285-307.
[93] Tatti, N. and Vreeken, J., Finding good itemsets by packing data, in Eighth IEEE Int. Conf. Data Mining (IEEE, 2008), pp. 588-597.
[94] Grünwald, P. and Harremoës, P., Finiteness of redundancy, regret, Shtarkov sums, and Jeffreys integrals in exponential families, in Proc. IEEE Int. Symp. Information Theory (IEEE, 2009), pp. 714-718.
[95] Bar-Lev, S. K., Bshouty, D., Grünwald, P. and Harremoës, P., Jeffreys versus Shtarkov distributions associated with some natural exponential families, Stat. Methodol.7(6) (2010) 638-643. · Zbl 1463.62024
[96] Kojima, M. and Komaki, F., Relations between the conditional normalized maximum likelihood distributions and the latent information priors, IEEE Trans. Inform. Theory62(1) (2016) 539-553. · Zbl 1359.94264
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.