×

Robust accelerated gradient methods for smooth strongly convex functions. (English) Zbl 1461.62145

Summary: We study the trade-offs between convergence rate and robustness to gradient errors in designing a first-order algorithm. We focus on gradient descent and accelerated gradient (AG) methods for minimizing strongly convex functions when the gradient has random errors in the form of additive white noise. With gradient errors, the function values of the iterates need not converge to the optimal value; hence, we define the robustness of an algorithm to noise as the asymptotic expected suboptimality of the iterate sequence to input noise power. For this robustness measure, we provide exact expressions for the quadratic case using tools from robust control theory and tight upper bounds for the smooth strongly convex case using Lyapunov functions certified through matrix inequalities. We use these characterizations within an optimization problem which selects parameters of each algorithm to achieve a particular trade-off between rate and robustness. Our results show that AG can achieve acceleration while being more robust to random gradient errors. This behavior is quite different than previously reported in the deterministic gradient noise setting. We also establish some connections between the robustness of an algorithm and how quickly it can converge back to the optimal solution if it is perturbed from the optimal point with deterministic noise. Our framework also leads to practical algorithms that can perform better than other state-of-the-art methods in the presence of random gradient noise.

MSC:

62L20 Stochastic approximation
90C15 Stochastic programming
90C25 Convex programming
90C30 Nonlinear programming
93C05 Linear systems in control theory
93C10 Nonlinear systems in control theory

Software:

CVX
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] N. S. Aybat, A. Fallah, M. Gürbüzbalaban, and A. Ozdaglar, A Universally Optimal Multistage Accelerated Stochastic Gradient Method, https://arxiv.org/abs/1901.08022, 2019.
[2] F. Bach and E. Moulines, Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n), in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds., Curran Associates, 2013, pp. 773-781.
[3] R. Bassily, A. Smith, and A. Thakurta, Private empirical risk minimization: Efficient algorithms and tight error bounds, in Proceedings of the 55th Annual Symposium on Foundations of Computer Science, IEEE, 2014, pp. 464-473.
[4] D. Bertsekas, Nonlinear Programming, Athena Scientific, Belmont, MA, 1999. · Zbl 1015.90077
[5] D. Bertsekas, Incremental gradient, subgradient, and proximal methods for convex optimization: A survey, Optim. Machine Learning, 2010 (2011), p. 3.
[6] B. Can, M. Gurbuzbalaban, and L. Zhu, Accelerated linear convergence of stochastic momentum methods in Wasserstein distances, in Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., in Proceedings of Machine Learning Research, Long Beach, CA, PMLR 97, 2019, pp. 891-901, http://proceedings.mlr.press/v97/can19a.html.
[7] B. Can, M. Gürbüzbalaban, and L. Zhu, Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances, https://arxiv.org/abs/1901.07445, 2019.
[8] X. Cheng, N. S. Chatterji, P. L. Bartlett, and M. I. Jordan, Underdamped Langevin MCMC: A Non-Asymptotic Analysis, https://arxiv.org/abs/1707.03663, 2017.
[9] S. Cyrus, B. Hu, B. Van Scoy, and L. Lessard, A robust accelerated optimization algorithm for strongly convex functions, in Proceedings of the Annual American Control Conference, 2018, pp. 1376-1381, https://doi.org/10.23919/ACC.2018.8430824.
[10] A. d’Aspremont, Smooth optimization with approximate gradient, SIAM J. Optim., 19 (2008), pp. 1171-1183, https://doi.org/10.1137/060676386. · Zbl 1180.90378
[11] O. Devolder, Exactness, Inexactness and Stochasticity in First-Order Methods for Large-Scale Convex Optimization, Ph.D. thesis, ICTEAM and CORE, Université Catholique de Louvain, 2013.
[12] O. Devolder, F. Glineur, and Y. Nesterov, Intermediate Gradient Methods for Smooth Convex Problems with Inexact Oracle, Tech. report, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2013. · Zbl 1317.90196
[13] O. Devolder, F. Glineur, and Y. Nesterov, First-order methods of smooth convex optimization with inexact oracle, Math. Program., 146 (2014), pp. 37-75. · Zbl 1317.90196
[14] A. Dieuleveut, A. Durmus, and F. Bach, Bridging the Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains, https://arxiv.org/abs/1707.06386, 2017.
[15] P. Dvurechensky and A. Gasnikov, Stochastic intermediate gradient method for convex problems with stochastic inexact oracle, J. Optim. Theory Appl., 171 (2016), pp. 121-145. · Zbl 1351.90150
[16] A. Eberle, A. Guillin, and R. Zimmer, Couplings and Quantitative Contraction Rates for Langevin Dynamics, https://arxiv.org/abs/1703.01617, 2017. · Zbl 1466.60160
[17] A. Edelman and H. Murakami, Polynomial roots from companion matrix eigenvalues, Math. Comp., 64 (1995), pp. 763-776. · Zbl 0833.65041
[18] M. Fazlyab, A. Ribeiro, M. Morari, and V. Preciado, Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems, SIAM J. Optim., 28 (2018), pp. 2654-2689, https://doi.org/10.1137/17M1136845. · Zbl 1406.90089
[19] N. Flammarion and F. Bach, From averaging to acceleration, there is only a step-size, in Proceedings of the Conference on Learning Theory, 2015, pp. 658-695.
[20] W. H. Fleming and M. James, The risk-sensitive index and the \(H_2\) and \(H_\infty \), norms for nonlinear systems, Math. Control Signals Systems, 8 (1995), pp. 199-221. · Zbl 0854.93045
[21] X. Gao, M. Gürbüzbalaban, and L. Zhu, Breaking Reversibility Accelerates Langevin Dynamics for Global Non-Convex Optimization, https://arxiv.org/abs/1812.07725, 2018.
[22] X. Gao, M. Gürbüzbalaban, and L. Zhu, Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Performance Bounds and Momentum-Based Acceleration, https://arxiv.org/abs/1809.04618, 2018.
[23] S. Ghadimi and G. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework, SIAM J. Optim., 22 (2012), pp. 1469-1492, https://doi.org/10.1137/110848864. · Zbl 1301.62077
[24] S. Ghadimi and G. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization II: Shrinking procedures and optimal algorithms, SIAM J. Optim., 23 (2013), pp. 2061-2089, https://doi.org/10.1137/110848876. · Zbl 1293.62167
[25] M. Grant and S. Boyd, CVX: MATLAB Software for Disciplined Convex Programming, Version 2.1, http://cvxr.com/cvx, 2014.
[26] W. Haddad and D. Bernstein, Parameter-dependent Lyapunov functions and the discrete-time Popov criterion for robust analysis, Automatica, 30 (1994), pp. 1015-1021. · Zbl 0825.93667
[27] M. Hardt, Robustness Versus Acceleration, http://blog.mrtz.org/2014/08/18/robustness-versus-acceleration , 2014.
[28] B. Hu and L. Lessard, Dissipativity theory for Nesterov’s accelerated method, in Proceedings of the 34th International Conference on Machine Learning, PMLR 70, 2017, pp. 1549-1557.
[29] B. Hu, P. Seiler, and L. Lessard, Analysis of Approximate Stochastic Gradient Using Quadratic Constraints and Sequential Semidefinite Programs, https://arxiv.org/abs/1711.00987, 2017.
[30] A. Jadbabaie and A. Olshevsky, Combinatorial bounds and scaling laws for noise amplification in networks, in Proceedings of the European Control Conference, IEEE, 2013, pp. 596-601.
[31] A. Jadbabaie and A. Olshevsky, On performance of consensus protocols subject to noise: Role of hitting times and network structure, in Proceedings of the 55th Conference on Decision and Control, IEEE, 2016, pp. 179-184.
[32] G. Lan, An optimal method for stochastic composite optimization, Math. Program., 133 (2012), pp. 365-397, https://doi.org/10.1007/s10107-010-0434-y. · Zbl 1273.90136
[33] L. Lessard, B. Recht, and A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints, SIAM J. Optim., 26 (2016), pp. 57-95. · Zbl 1329.90103
[34] S. Michalowsky and C. Ebenbauer, The multidimensional n-th order heavy ball method and its application to extremum seeking, in Proceedings of the 53rd Conference on Decision and Control, IEEE, 2014, pp. 2660-2666.
[35] H. Mohammadi, M. Razaviyayn, and M. R. Jovanović, Variance amplification of accelerated first-order algorithms for strongly convex quadratic optimization problems, in Proceedings of the IEEE Conference on Decision and Control, IEEE, 2018, pp. 5753-5758.
[36] Y. Nesterov, A method of solving a convex programming problem with convergence rate \(o(1/k^2)\), Soviet Math. Dokl., 27 (1983), pp. 372-376. · Zbl 0535.90071
[37] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Appl. Optim. 87, Springer, New York, 2004, https://doi.org/10.1007/978-1-4419-8853-9. · Zbl 1086.90045
[38] B. O’Donoghue and E. Candès, Adaptive restart for accelerated gradient schemes, Found. Comput. Math., 15 (2015), pp. 715-732. · Zbl 1320.90061
[39] B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM J. Control Optim., 30 (1992), pp. 838-855. · Zbl 0762.62022
[40] M. Raginsky, A. Rakhlin, and M. Telgarsky, Non-convex learning via stochastic gradient langevin dynamics: A nonasymptotic analysis, in Proceedings of the 2017 Conference on Learning Theory, Amsterdam, 2017, PMLR 65, 2017, pp. 1674-1703.
[41] H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), pp. 400-407. · Zbl 0054.05901
[42] S. Safavi, B. Joshi, G. França, and J. Bento, An explicit convergence rate for Nesterov’s method from sdp, in Proceedings of the International Symposium on Information Theory, IEEE, 2018, pp. 1560-1564.
[43] M. Schmidt, N. Le Roux, and F. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, in Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, eds., Curran Associates, 2011, pp. 1458-1466.
[44] A. A. Stoorvogel, The robust \(H_2\) control problem: A worst-case design, IEEE Trans. Automat. Control, 38 (1993), pp. 1358-1371, https://doi.org/10.1109/9.237647. · Zbl 0787.93025
[45] A. Wilson, B. Recht, and M. Jordan, A Lyapunov Analysis of Momentum Methods in Optimization, https://arxiv.org/abs/1611.02635, 2016.
[46] K. Zhou, J. C. Doyle, and K. Glover, Robust and Optimal Control, Prentice Hall, Englewood Cliffs, NJ, 1996. · Zbl 0999.49500
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.