×

Maximum principle based algorithms for deep learning. (English) Zbl 1467.68156

Summary: The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms. Training is recast as a control problem and this allows us to formulate necessary optimality conditions in continuous time using the Pontryagin’s maximum principle (PMP). A modification of the method of successive approximations is then used to solve the PMP, giving rise to an alternative training algorithm for deep learning. This approach has the advantage that rigorous error estimates and convergence results can be established. We also show that it may avoid some pitfalls of gradient-based methods, such as slow convergence on flat landscapes near saddle points. Furthermore, we demonstrate that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out – a step which is still in need of improvement. Overall, the approach opens up new avenues to attack problems associated with deep learning, such as trapping in slow manifolds and inapplicability of gradient-based methods for discrete trainable variables.

MSC:

68T07 Artificial neural networks and deep learning
49N90 Applications of optimal control and differential games
62M45 Neural nets and related approaches to inference from stochastic processes
PDFBibTeX XMLCite
Full Text: arXiv Link

References:

[1] Vladimir V Aleksandrov. On the accumulation of perturbations in the linear systems with two coordinates. Vestnik MGU, 3, 1968.
[2] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
[3] Michael Athans and Peter L Falb. Optimal control: an introduction to the theory and its applications. Courier Corporation, 2013.
[4] Atilim G Baydin, Barak A Pearlmutter, Alexey A Radul, and Jeffrey M Siskind. Automatic differentiation in machine learning: a survey. arXiv preprint arXiv:1502.05767, 2015.
[5] Mokhtar S Bazaraa, Hanif D Sherali, and Chitharanjan M Shetty. Nonlinear programming: theory and algorithms. John Wiley & Sons, 2013. · Zbl 1140.90040
[6] Richard Bellman. Dynamic programming. Courier Corporation, 2013.
[7] Yoshua Bengio. Learning deep architectures for AI. Foundations and trends in Machine Learning, 2(1):1–127, 2009. · Zbl 1192.68503
[8] Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995. · Zbl 0904.90170
[9] Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999. · Zbl 1015.90077
[10] John T Betts. Survey of numerical methods for trajectory optimization. Journal of Guidance control and dynamics, 21(2):193–207, 1998. · Zbl 1158.49303
[11] Vladimir Grigor’evich Boltyanskii, Revaz Valer’yanovich Gamkrelidze, and Lev Semenovich Pontryagin. The theory of optimal processes. i. the maximum principle. Technical report, TRW SPACE TECHNOLOGY LABS LOS ANGELES CALIF, 1960.
[12] L´eon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
[13] Alberto Bressan and Benedetto Piccoli. Introduction to mathematical control theory. AIMS series on applied mathematics, Philadelphia, 2007. · Zbl 1127.93002
[14] Arthur Earl Bryson. Applied optimal control: optimization, estimation and control. CRC Press, 1975.
[15] Anatolii B Butkovsky.Necessary and sufficient optimality conditions for sampled-data control systems. Avtomat. i Telemekh, 24(8):1056–1064, 1963.
[16] Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible architectures for arbitrarily deep residual neural networks. arXiv preprint arXiv:1709.03698, 2017. 26
[17] Felix L Chernousko and Alexey A Lyubushin. Method of successive approximations for solution of optimal control problems. Optimal Control Applications and Methods, 3(2): 101–114, 1982. · Zbl 0485.49003
[18] Francis Clarke. The maximum principle in optimal control, then and now. Control and Cybernetics, 34(3):709, 2005. · Zbl 1167.49311
[19] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pages 3123–3131, 2015.
[20] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to + 1 or - 1. arXiv preprint arXiv:1602.02830, 2016.
[21] Wojciech M Czarnecki, Grzegorz ´Swirszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals, and Koray Kavukcuoglu. Understanding synthetic gradients and decoupled neural interfaces. arXiv preprint arXiv:1703.00522, 2017.
[22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
[23] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul): 2121–2159, 2011. · Zbl 1280.68164
[24] Weinan E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11, 2017. · Zbl 1380.37154
[25] Thomas Frerix, Thomas M¨ollenhoff, Michael Moeller, and Daniel Cremers. Proximal backpropagation. arXiv preprint arXiv:1706.04638, 2017.
[26] J Willard Gibbs. Elementary principles in statistical mechanics. Courier Corporation, 2014. · Zbl 1200.00032
[27] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. · Zbl 1373.68009
[28] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. arXiv preprint arXiv:1705.03341, 2017. · Zbl 1426.68236
[29] Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung. Digital selection and analogue amplification coexist in a cortexinspired silicon circuit. Nature, 405(6789):947, 2000.
[30] Hubert Halkin. A maximum principle of the pontryagin type for systems described by nonlinear difference equations. SIAM Journal on control, 4(1):90–111, 1966. · Zbl 0152.09301
[31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 27
[32] Magnus R Hestenes. Multiplier and gradient methods. Journal of optimization theory and applications, 4(5):303–320, 1969. · Zbl 0174.20705
[33] R Jackson and F Horn. On discrete analogues of pontryagin’s maximum principle. International Journal of Control, 1(4):389–395, 1965.
[34] Max Jaderberg, Wojciech M Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343, 2016.
[35] Robert I Jennrich. Asymptotic properties of non-linear least squares estimators. The Annals of Mathematical Statistics, 40(2):633–643, 1969. · Zbl 0193.47201
[36] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315– 323, 2013.
[37] Henry J Kelley. Gradient theory of optimal flight paths. Ars Journal, 30(10):947–954, 1960. · Zbl 0096.42002
[38] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[39] Ivan A Krylov and Felix L Chernousko. On the method of successive approximations for solution of optimal control problems. J. Comp. Mathem. and Mathematical Physics, 2 (6), 1962.
[40] Yann LeCun. A theoretical framework for back-propagation. In The Connectionist Models Summer School, volume 1, pages 21–28, 1988.
[41] Yann LeCun.The MNIST database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998.
[42] Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
[43] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[44] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553): 436–444, 2015.
[45] Kwang Y Lee and Mohamed A El-Sharkawi. Modern heuristic optimization techniques: theory and applications to power systems, volume 39. John Wiley & Sons, 2008.
[46] Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101– 2110, 2017.
[47] Daniel Liberzon. Calculus of variations and optimal control theory: a concise introduction. Princeton University Press, 2012. 28 · Zbl 1239.49001
[48] Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical programming, 45(1):503–528, 1989. · Zbl 0696.90048
[49] Alexey A Lyubushin. Modifications of the method of successive approximations for solving optimal control problems. USSR Computational Mathematics and Mathematical Physics, 22(1):29–34, 1982. · Zbl 0504.49017
[50] Zbigniew Nahorski, Hans F Ravn, and Ren´e Victor Valqui Vidal. The discrete-time maximum principle: a survey and some new results. International Journal of Control, 40(3): 533–554, 1984. · Zbl 0549.49021
[51] Nikolay Pogodaev. Optimal control of continuity equations. Nonlinear Differential Equations and Applications, 23(2):21, 2016.
[52] Lev S Pontryagin. Mathematical theory of optimal processes. CRC Press, 1987.
[53] Anil V Rao. A survey of numerical methods for optimal control. Advances in the Astronautical Sciences, 135(1):497–528, 2009.
[54] Hannes Risken. Fokker-planck equation. In The Fokker-Planck Equation, pages 63–95. Springer, 1996. · Zbl 0866.60071
[55] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951. · Zbl 0054.05901
[56] Sanford M Roberts and Jerome S Shipman. Two-point boundary value problems: shooting methods. SIAM Rev., 16(2):265266, 1972. · Zbl 0239.65061
[57] Souvik Roy and Alfio Borz. Numerical investigation of a class of liouville control problems. J Sci Comput, 73:178, 2017. · Zbl 1386.35409
[58] Lev I Rozonoer. The maximum principle of L.S. Pontryagin in optimal-system theory. Automation and Remote Control, 20(10):11, 1959.
[59] J¨urgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61: 85–117, 2015.
[60] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
[61] Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein. Training neural networks without gradients: A scalable ADMM approach. In International Conference on Machine Learning, pages 2722–2731, 2016.
[62] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
[63] Matthew D Zeiler.Adadelta:an adaptive learning rate method.arXiv preprint arXiv:1212.5701, 2012.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.