×

Infinite-dimensional gradient-based descent for alpha-divergence minimisation. (English) Zbl 1486.62065

Summary: This paper introduces the \((\alpha ,\Gamma )\)-descent, an iterative algorithm which operates on measures and performs \(\alpha \)-divergence minimisation in a Bayesian framework. This gradient-based procedure extends the commonly-used variational approximation by adding a prior on the variational parameters in the form of a measure. We prove that for a rich family of functions \(\Gamma \), this algorithm leads at each step to a systematic decrease in the \(\alpha \)-divergence and derive convergence results. Our framework recovers the Entropic Mirror Descent algorithm and provides an alternative algorithm that we call the Power Descent. Moreover, in its stochastic formulation, the \((\alpha ,\Gamma )\)-descent allows to optimise the mixture weights of any given mixture model without any information on the underlying distribution of the variational parameters. This renders our method compatible with many choices of parameters updates and applicable to a wide range of Machine Learning tasks. We demonstrate empirically on both toy and real-world examples the benefit of using the Power Descent and going beyond the Entropic Mirror Descent framework, which fails as the dimension grows.

MSC:

62F15 Bayesian inference
62F30 Parametric inference under constraints
62F35 Robustness and adaptive procedures (parametric inference)
62G07 Density estimation
62L99 Sequential statistical methods
PDFBibTeX XMLCite
Full Text: DOI arXiv Link

References:

[1] Bamler, R., Zhang, C., Opper, M. and Mandt, S. (2017). Perturbative black box variational inference. In Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett, eds.) 5079-5088. Curran Associates.
[2] Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. PhD Thesis.
[3] Beck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31 167-175. · Zbl 1046.90057 · doi:10.1016/S0167-6377(02)00231-6
[4] Blei, D. M., Kucukelbir, A. and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. J. Amer. Statist. Assoc. 112 859-877. · doi:10.1080/01621459.2017.1285773
[5] Blei, D. M., Ng, A. Y. and Jordan, M. (2003). Latent Dirichlet allocation. The Journal of Machine Learning Research 3 993-1022. · Zbl 1112.68379
[6] Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 (Y. Lechevallier and G. Saporta, eds.) 177-186. Physica-Verlag/Springer, Heidelberg. · Zbl 1436.68293
[7] Bubeck, S. (2015). Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 8 231-357. · Zbl 1365.90196 · doi:10.1561/2200000050
[8] Chérief-Abdellatif, B.-E., Alquier, P. and Khan, M. E. (2019). A generalization bound for online variational inference. In Proceedings of the 29th International Conference on Machine Learning 101 662-677.
[9] Chopin, N. (2004). Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference. Ann. Statist. 32 2385-2411. · Zbl 1079.65006 · doi:10.1214/009053604000000698
[10] Cichocki, A. and Amari, S. (2010). Families of alpha- beta- and gamma-divergences: Flexible and robust measures of similarities. Entropy 12 1532-1568. · Zbl 1229.94030 · doi:10.3390/e12061532
[11] Cichocki, A., Cruces, S. and Amari, S.-i. (2011). Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 13 134-170. · doi:10.3390/e13010134
[12] Daudel, K., Douc, R. and Portier, F. (2021). Supplement to “Infinite-dimensional gradient-based descent for alpha-divergence minimisation.” https://doi.org/10.1214/20-AOS2035SUPP
[13] Dehaene, G. and Barthelmé, S. (2018). Expectation propagation in the large data limit. J. R. Stat. Soc. Ser. B. Stat. Methodol. 80 199-217. · Zbl 1380.62108 · doi:10.1111/rssb.12241
[14] Delyon, B. and Portier, F. (2019). Safe and adaptive importance sampling: A mixture approach. Available online: https://arxiv.org/abs/1903.08507 (accessed on 20 March 2020). · Zbl 1472.62016
[15] Dieng, A. B., Tran, D., Ranganath, R., Paisley, J. and Blei, D. (2017). Variational inference via \(χ\) upper bound minimization. In Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett, eds.) 2732-2741. Curran Associates.
[16] Douc, R., Guillin, A., Marin, J.-M. and Robert, C. P. (2007). Convergence of adaptive mixtures of importance sampling schemes. Ann. Statist. 35 420-448. · Zbl 1132.60022 · doi:10.1214/009053606000001154
[17] Gershman, S., Hoffman, M. D. and Blei, D. M. (2012). Nonparametric variational inference. In Proceedings of the 29th International Conference on Machine Learning.
[18] Hellinger, E. (1909). Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. J. Reine Angew. Math. 136 210-271. · JFM 40.0393.01 · doi:10.1515/crll.1909.136.210
[19] Hernandez-Lobato, J., Li, Y., Rowland, M., Bui, T., Hernandez-Lobato, D. and Turner, R. (2016). Black-box alpha divergence minimization. In Proceedings of the 33rd International Conference on Machine Learning (M. F. Balcan and K. Q. Weinberger, eds.). Proceedings of Machine Learning Research 48 1511-1520. PMLR, New York, New York, USA.
[20] Hoffman, M. D., Blei, D. M., Wang, C. and Paisley, J. (2013). Stochastic variational inference. J. Mach. Learn. Res. 14 1303-1347. · Zbl 1317.68163
[21] Hsieh, Y.-P., Liu, C. and Cevher, V. (2019). Finding mixed Nash equilibria of generative adversarial networks. In Proceedings of the 36th International Conference on Machine Learning (K. Chaudhuri and R. Salakhutdinov, eds.). Proceedings of Machine Learning Research 97 2810-2819. PMLR, Long Beach, CA.
[22] Jaakkola, T. S. and Jordan, M. I. (1998). Improving the mean field approximation via the use of mixture distributions. In Learning in Graphical Models (Jordan, M. I., ed.). NATO ASI Series (Series D: Behavioural and Social Sciences) 89. Springer. · Zbl 0953.60100
[23] Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul, L. K. (1999). An introduction to variational methods for graphical models. Mach. Learn. 37 183-233. · Zbl 0945.68164
[24] Kloek, T. and Van Dijk, H. K. (1978). Bayesian estimates of equation system parameters: An application of integration by Monte Carlo. Econometrica 1-19. · Zbl 0376.62014
[25] Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Ann. Math. Stat. 22 79-86. · Zbl 0042.38403 · doi:10.1214/aoms/1177729694
[26] Li, Y., Hernández-Lobato, J. M. and Turner, R. E. (2015). Stochastic expectation propagation. In Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett, eds.) 2323-2331. Curran Associates.
[27] Li, Y. and Turner, R. E. (2016). Rényi divergence variational inference. In Advances in Neural Information Processing Systems 29 (D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett, eds.) 1073-1081. Curran Associates.
[28] Lindsay, B. G. (1994). Efficiency versus robustness: The case for minimum Hellinger distance and related methods. Ann. Statist. 22 1081-1114. · Zbl 0807.62030 · doi:10.1214/aos/1176325512
[29] Minka, T. (2004). Power EP. Technical Report No. MSR-TR-2004-149.
[30] Minka, T. (2005). Divergence measures and message passing. Technical Report No. MSR-TR-2005-173.
[31] Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. UAI’01 362-369. Morgan Kaufmann Publishers Inc., San Francisco, CA.
[32] Morimoto, T. (1963). Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitat von Markoffschen Ketten. Magy. Tud. Akad. Mat. Kut. Intéz. Közl. 85-108. · Zbl 0124.08703
[33] Morimoto, T. (1963). Markov processes and the \(H\)-theorem. J. Phys. Soc. Jpn. 18 328-331. · Zbl 0156.23804 · doi:10.1143/JPSJ.18.328
[34] Nemirovski, A. (2004). Prox-method with rate of convergence \[O(1/t)\] for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15 229-251. · Zbl 1106.90059 · doi:10.1137/S1052623403425629
[35] Nemirovski, A., Juditsky, A., Lan, G. and Shapiro, A. (2008). Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19 1574-1609. · Zbl 1189.90109 · doi:10.1137/070704277
[36] Oh, M.-S. and Berger, J. O. (1992). Adaptive importance sampling in Monte Carlo integration. J. Stat. Comput. Simul. 41 143-168. · Zbl 0781.65016 · doi:10.1080/00949659208810398
[37] Opper, M. and Winther, O. (2000). Gaussian processes for classification: Mean-field algorithms. Neural Comput. 12 2655-2684. · doi:10.1162/089976600300014881
[38] Paisley, J., Blei, D. and Jordan, M. (2012). Variational Bayesian inference with stochastic search. In Proceedings of the 29th International Conference on Machine Learning 1363-1370.
[39] Ranganath, R., Gerrish, S. and Blei, D. (2014). Black box variational inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics (S. Kaski and J. Corander, eds.). Proceedings of Machine Learning Research 33 814-822. PMLR, Reykjavik, Iceland.
[40] Ranganath, R., Tran, D. and Blei, D. (2016). Hierarchical variational models. In Proceedings of the 33rd International Conference on Machine Learning (M. F. Balcan and K. Q. Weinberger, eds.). Proceedings of Machine Learning Research 48 324-333. PMLR, New York, New York, USA.
[41] Rényi, A. (1961). On measures of entropy and information. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. I 547-561. Univ. California Press, Berkeley, CA. · Zbl 0106.33001
[42] Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat. 22 400-407. · Zbl 0054.05901 · doi:10.1214/aoms/1177729586
[43] Sason, I. (2018). On \(f\)-divergences: Integral representations, local behavior, and inequalities. Entropy 20 Paper No. 383, 32. · doi:10.3390/e20050383
[44] Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10 1040-1053. · Zbl 0511.62048
[45] van Erven, T. and Harremoës, P. (2014). Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 60 3797-3820. · Zbl 1360.94180 · doi:10.1109/TIT.2014.2320500
[46] Wang, D., Liu, H. and Liu, Q. (2018). Variational inference with tail-adaptive f-divergence. In Advances in Neural Information Processing Systems 31 (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett, eds.) 5737-5747. Curran Associates.
[47] Yin, M. and Zhou, M. (2018). Semi-implicit variational inference. In Proceedings of the 35th International Conference on Machine Learning (J. Dy and A. Krause, eds.). Proceedings of Machine Learning Research 80 5660-5669. PMLR, Stockholmsmässan, Stockholm Sweden.
[48] Zhang, C., Butepage, J., Kjellstrom, H. and Mandt, S. (2019). Advances in variational inference. IEEE Trans. Pattern Anal. Mach. Intell. 41 2008-2026.
[49] Zhu, H. and Rohwer, R. (1995). Bayesian invariant measurements of generalization. Neural Process. Lett. 2 28-31. · doi:10.1007/BF02309013
[50] Zhu, H. and Rohwer, R. (1995). Information geometric measurements of generalisation. Technical Report No. NCRG/4350
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.