×

Stochastic gradient descent and fast relaxation to thermodynamic equilibrium: a stochastic control approach. (English) Zbl 1491.82014

Summary: We study the convergence to equilibrium of an underdamped Langevin equation that is controlled by a linear feedback force. Specifically, we are interested in sampling the possibly multimodal invariant probability distribution of a Langevin system at small noise (or low temperature), for which the dynamics can easily get trapped inside metastable subsets of the phase space. We follow [Y. Chen et al., J. Math. Phys. 56, No. 11, 113302, 17 p. (2015; Zbl 1327.82063)] and consider a Langevin equation that is simulated at a high temperature, with the control playing the role of a friction that balances the additional noise so as to restore the original invariant measure at a lower temperature. We discuss different limits as the temperature ratio goes to infinity and prove convergence to a limit dynamics. It turns out that, depending on whether the lower (“target”) or the higher (“simulation”) temperature is fixed, the controlled dynamics converges either to the overdamped Langevin equation or to a deterministic gradient flow. This implies that (a) the ergodic limit and the large temperature separation limit do not commute in general and that (b) it is not possible to accelerate the speed of convergence to the ergodic limit by making the temperature separation larger and larger. We discuss the implications of these observations from the perspective of stochastic optimization algorithms and enhanced sampling schemes in molecular dynamics.
©2021 American Institute of Physics

MSC:

82C31 Stochastic methods (Fokker-Planck, Langevin, etc.) applied to problems in time-dependent statistical mechanics
82M31 Monte Carlo methods applied to problems in statistical mechanics
82M37 Computational molecular dynamics in statistical mechanics
93B52 Feedback control
82C35 Irreversible thermodynamics, including Onsager-Machlup theory
65K10 Numerical optimization and variational techniques

Citations:

Zbl 1327.82063
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] An, J.; Lu, J.; Ying, L., Stochastic modified equations for the asynchronous stochastic gradient descent, Inf. Inference, 9, 851 (2019) · Zbl 1528.65005 · doi:10.1093/imaiai/iaz030
[2] Arnold, A.; Erb, J., Sharp entropy decay for hypocoercive and non-symmetric Fokker-Planck equations with linear drift (2014)
[3] Bernardi, R. C.; Melo, M. C. R.; Schulten, K., Enhanced sampling techniques in molecular dynamics simulations of biological systems, Biochim. Biophys. Acta, Gen. Subj., 1850, 5, 872-877 (2015) · doi:10.1016/j.bbagen.2014.10.019
[4] Betancourt, M., The convergence of Markov chain Monte Carlo methods: From the Metropolis method to Hamiltonian Monte Carlo, Ann. Phys., 531, 3, 1700214 (2019) · Zbl 07759707 · doi:10.1002/andp.201700214
[5] Bogachev, V. I.; Röckner, M.; Shaposhnikov, S. V., Distances between transition probabilities of diffusions and applications to nonlinear Fokker-Planck-Kolmogorov equations, J. Funct. Anal., 271, 5, 1262-1300 (2016) · Zbl 1345.35051 · doi:10.1016/j.jfa.2016.05.016
[6] Borysenko, O.; Byshkin, M., CoolMomentum: A method for stochastic optimization by Langevin dynamics with simulated annealing, Sci. Rep., 11, 1, 10705 (2021) · doi:10.1038/s41598-021-90144-3
[7] Chen, Y.; Georgiou, T. T.; Pavon, M., Fast cooling for a system of stochastic oscillators, J. Math. Phys., 56, 113302 (2015) · Zbl 1327.82063 · doi:10.1063/1.4935435
[8] Chen, Y.; Georgiou, T. T.; Pavon, M., Optimal steering of a linear stochastic system to a final probability distribution, Part I, IEEE Trans. Autom. Control, 61, 1158-1169 (2016) · Zbl 1359.93532 · doi:10.1109/tac.2015.2457784
[9] Chen, Y.; Georgiou, T. T.; Pavon, M., Optimal steering of a linear stochastic system to a final probability distribution, Part II, IEEE Trans. Autom. Control, 61, 1170-1180 (2016) · Zbl 1359.93533 · doi:10.1109/tac.2015.2457791
[10] Cheng, X.; Chatterji, N. S.; Bartlett, P. L.; Jordan, M. I.; Bubeck, S.; Perchet, V.; Rigollet, P., Underdamped Langevin MCMC: A non-asymptotic analysis, 300-323 (2018), PMLR
[11] Dai Pra, P., A stochastic control approach to reciprocal diffusion processes, Appl. Math. Optim., 23, 313-329 (1991) · Zbl 0728.93079 · doi:10.1007/bf01442404
[12] Dembo, A.; Zeitouni, O., Large Deviations Techniques and Applications (1998), Springer · Zbl 0896.60013
[13] Duncan, A. B.; Nüsken, N.; Pavliotis, G. A., Using perturbed underdamped Langevin dynamics to efficiently sample from probability distributions, J. Stat. Phys., 169, 1098-1131 (2017) · Zbl 1387.35580 · doi:10.1007/s10955-017-1906-8
[14] Duong, M. H.; Lamacz, A.; Peletier, M. A.; Schlichting, A.; Sharma, U., Quantification of coarse-graining error in Langevin and overdamped Langevin dynamics, Nonlinearity, 31, 10, 4517-4566 (2018) · Zbl 1394.35210 · doi:10.1088/1361-6544/aaced5
[15] Duong, M. H.; Peletier, M. A.; Zimmer, J., GENERIC formalism of a Vlasov-Fokker-Planck equation and connection to large-deviation principles, Nonlinearity, 26, 11, 2951-2971 (2013) · Zbl 1288.60029 · doi:10.1088/0951-7715/26/11/2951
[16] Eberle, A.; Guillin, A.; Zimmer, R., Couplings and quantitative contraction rates for Langevin dynamics, Ann. Probab., 47, 4, 1982-2010 (2019) · Zbl 1466.60160 · doi:10.1214/18-aop1299
[17] Ferré, G.; Touchette, H., Adaptive sampling of large deviations, J. Stat. Phys., 172, 6, 1525-1544 (2018) · Zbl 1416.65041 · doi:10.1007/s10955-018-2108-8
[18] Girolami, M.; Calderhead, B., Riemann manifold Langevin and Hamiltonian Monte Carlo methods, J. R. Stat. Soc., Ser. B, 73, 2, 123-214 (2011) · Zbl 1411.62071 · doi:10.1111/j.1467-9868.2010.00765.x
[19] Goodfellow, I. J.; Vinyals, O.; Bengio, Y.; LeCun, Y., Qualitatively characterizing neural network optimization problems
[20] Hartmann, C.; Schütte, C.; Zhang, W., Jarzynski’s equality, fluctuation theorems, and variance reduction: Mathematical analysis and numerical algorithms, J. Stat. Phys., 175, 6, 1214-1261 (2019) · Zbl 1416.60078 · doi:10.1007/s10955-019-02286-4
[21] Holley, R. A.; Kusuoka, S.; Stroock, D. W., Asymptotics of the spectral gap with applications to the theory of simulated annealing, J. Funct. Anal., 83, 2, 333-347 (1989) · Zbl 0706.58075 · doi:10.1016/0022-1236(89)90023-2
[22] Hu, K.; Kazeykina, A.; Ren, Z., Mean-field Langevin system, optimal control and deep neural networks (2019)
[23] Hu, K.; Ren, Z.; Šiška, D.; Szpruch, L., Mean-field Langevin dynamics and energy landscape of neural networks, Ann. Inst. Henri Poincare Probab. Stat., 57, 4, 2043-2065 (2021) · Zbl 1492.65023 · doi:10.1214/20-aihp1140
[24] Hwang, C.-R.; Hwang-Ma, S.-Y.; Sheu, S.-J., Accelerating diffusions, Ann. Appl. Probab., 15, 2, 1433-1444 (2005) · Zbl 1069.60065 · doi:10.1214/105051605000000025
[25] Kontis, V.; Ottobre, M.; Zegarlinski, B., Markov semigroups with hypocoercive-type generator in infinite dimensions: Ergodicity and smoothing, J. Funct. Anal., 270, 9, 3173-3223 (2016) · Zbl 1341.47055 · doi:10.1016/j.jfa.2016.02.005
[26] Leimkuhler, B.; Matthews, C., Rational construction of stochastic numerical methods for molecular sampling, Appl. Math. Res. eXpress, 2013, 1, 34-56 · Zbl 1264.82102 · doi:10.1093/amrx/abs010
[27] Leimkuhler, B.; Matthews, C.; Vlaar, T., Partitioned integrators for thermodynamic parameterization of neural networks, Found. Data Sci., 1, 4, 457-489 (2019) · doi:10.3934/fods.2019019
[28] Lelièvre, T.; Nier, F.; Pavliotis, G. A., Optimal non-reversible linear drift for the convergence to equilibrium of a diffusion, J. Stat. Phys., 152, 2, 237-274 (2013) · Zbl 1276.82042 · doi:10.1007/s10955-013-0769-x
[29] Li, Q.; Tai, C.; E, W., Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations, J. Mach. Learn. Res., 20, 40, 1-47 (2019) · Zbl 1484.62106
[30] Löwe, M., Simulated annealing with time-dependent energy function via Sobolev inequalities, Stochastic Process. Appl., 63, 2, 221-233 (1996) · Zbl 0910.60060 · doi:10.1016/0304-4149(96)00070-1
[31] Loyola R, D. G.; Pedergnana, M.; Gimeno García, S., Smart sampling and incremental function learning for very large high dimensional data, Neural Networks, 78, 75-87 (2016) · Zbl 1414.68066 · doi:10.1016/j.neunet.2015.09.001
[32] Ma, Y.-A.; Chen, Y.; Jin, C.; Flammarion, N.; Jordan, M. I., Sampling can be faster than optimization, Proc. Natl. Acad. Sci. U. S. A., 116, 42, 20881-20885 (2019) · Zbl 1433.68397 · doi:10.1073/pnas.1820003116
[33] Mengersen, K. L.; Tweedie, R. L., Rates of convergence of the Hastings and Metropolis algorithms, Ann. Stat., 24, 1, 101-121 (1996) · Zbl 0854.60065 · doi:10.1214/aos/1033066201
[34] Metafune, G., L^p-spectrum of Ornstein-Uhlenbeck operators, Ann. Sc. Norm. Super. Pisa, Classe Sci., 30, 1, 97-124 (2001) · Zbl 1065.35216
[35] Mitter, S. K.; Newton, N. J., A variational approach to nonlinear estimation, J. Control Optim., 42, 5, 1813-1833 (2003) · Zbl 1049.93082 · doi:10.1137/s0363012901393894
[36] Monmarché, P., Hypocoercivity in metastable settings and kinetic simulated annealing, Probab. Theory Relat. Fields, 172, 3, 1215-1248 (2018) · Zbl 1404.60120 · doi:10.1007/s00440-018-0828-y
[37] Monmarché, P.; Fournier, N.; Tardif, C., Simulated annealing in \(\mathbb{R}^d\) with slowly growing potentials, Stochastic Process Appl., 131, 276-291 (2021) · doi:10.1016/j.spa.2020.09.014
[38] Neal, R. M., Bayesian Learning for Neural Networks (2012), Springer: Springer, New York
[39] Nelson, E., Dynamical Theories of Brownian Motion (1967), Princeton University Press · Zbl 0165.58502
[40] Neureither, L.; Hartmann, C.; Giacomin, G.; Olla, S.; Saada, E.; Spohn, H.; Stoltz, G., Time scales and exponential trends to equilibrium: Gaussian model problems, Stochastic Dynamics Out of Equilibrium, 391-410 (2019), Springer · Zbl 1442.82024
[41] Specifically, we assume that \(\begin{matrix} \lim_{| z | \to \infty} v_t(z) \eta_t(z) = 0, \lim_{| z | \to \infty} \overline{v}_t(z) \eta_t(z) = 0, \lim_{| z | \to \infty} \overline{v}_t(z) \eta_t(z) \log \left(\frac{\eta_t(z)}{\rho_t(z)}\right) = 0 . \end{matrix} \)
[42] Petersen, K. B. and Pedersen, M. S., The Matrix Cookbook, 2012, Version 20121115.
[43] Pinnau, R.; Totzeck, C.; Tse, O.; Martin, S., A consensus-based model for global optimization and its mean-field limit, Math. Models Methods Appl. Sc., 27, 1, 183-204 (2017) · Zbl 1388.90098 · doi:10.1142/S0218202517400061
[44] Reich, S., Data assimilation: The Schrödinger perspective, Acta Numer., 28, 635-711 (2019) · Zbl 1437.62350 · doi:10.1017/S0962492919000011
[45] Rey-Bellet, L.; Spiliopoulos, K., Irreversible Langevin samplers and variance reduction: A large deviations approach, Nonlinearity, 28, 7, 2081 (2015) · Zbl 1338.60086 · doi:10.1088/0951-7715/28/7/2081
[46] Robert, C. P.; Elvira, V.; Tawn, N.; Wu, C., Accelerating MCMC algorithms, WIREs Comput. Stat., 10, 5, e1435 (2018) · doi:10.1002/wics.1435
[47] Rousset, M.; Stolz, G.; Lelièvre, T., Free Energy Computations: A Mathematical Perspective (2010), World Scientific · Zbl 1227.82002
[48] Sanz Serna, J. M.; Zygalakis, K. C., The connections between Lyapunov functions for some optimization algorithms and differential equations, SIAM J. Numer. Anal., 59, 3, 1542-1565 (2021) · Zbl 1467.65070
[49] Sharma, U., Coarse-graining of Fokker-Planck equations (2017), Department of Mathematics and Computer Science, Technische Universiteit Eindhoven
[50] Tordoff, B.; Murray, D. W.; Heyden, A.; Sparr, G.; Nielsen, M.; Johansen, P., Guided sampling and consensus for motion estimation, Computer Vision—ECCV 2002, 82-96 (2002), Springer: Springer, Berlin, Heidelberg · Zbl 1034.68684
[51] Vanbiervliet, J.; Vandereycken, B.; Michiels, W.; Vandewalle, S.; Diehl, M., The smoothed spectral abscissa for robust stability optimization, SIAM J. Optim., 20, 1, 156-171 (2009) · Zbl 1185.93110 · doi:10.1137/070704034
[52] Villani, C., Hypocoercivity (2009), AMS: AMS, Providence, RI · Zbl 1197.35004
[53] Wimmer, H. K., Roth’s theorems for matrix equations with symmetry constraints, Linear Algebra Appl., 199, 357-362 (1994) · Zbl 0796.15014
[54] Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O., Understanding deep learning requires rethinking generalization
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.