zbMATH — the first resource for mathematics

Nonconvex policy search using variational inequalities. (English) Zbl 07063522
Summary: Policy search is a class of reinforcement learning algorithms for finding optimal policies in control problems with limited feedback. These methods have been shown to be successful in high-dimensional problems such as robotics control. Though successful, current methods can lead to unsafe policy parameters that potentially could damage hardware units. Motivated by such constraints, we propose projection-based methods for safe policies.
These methods, however, can handle only convex policy constraints. In this letter, we propose the first safe policy search reinforcement learner capable of operating under nonconvex policy constraints. This is achieved by observing, for the first time, a connection between nonconvex variational inequalities and policy search problems. We provide two algorithms, Mann and two-step iteration, to solve the above problems and prove convergence in the nonconvex stochastic setting. Finally, we demonstrate the performance of the algorithms on six benchmark dynamical systems and show that our new method is capable of outperforming previous methods under a variety of settings.
68 Computer science
Full Text: DOI
[1] Ammar, H., Tutunov, R., & Eaton, E. (2015). Safe policy search for lifelong reinforcement learning with sublinear regret. In Proceedings of the 32nd International Conference on Machine Learning (pp. 2361-2369). Cambridge, MA: AAAI Press.
[2] Ansari, Q. H., & Balooee, J. (2013). Predictor-corrector methods for general regularized nonconvex variational inequalities. Journal of Optimization Theory and Applications, 159(2), 473-488. , · Zbl 1280.49012
[3] Bertsekas, D. P. (2009). Projected equations, variational inequalities, and temporal difference methods (Lab. for Information and Decision Systems Report LIDS-P-2808). Cambridge, MA: MIT Press.
[4] Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., & Lee, M. (2009). Natural actor-critic algorithms. Automatica, 45(11), 2471-2482. , · Zbl 1183.93130
[5] Bou Ammar, H., Eaton, E., Luna, J. M., & Ruvolo, P. (2015). Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning. In Proceedings of the International Joint Conference on Artificial Intelligence. Cambridge, MA: AAAI Press.
[6] Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. , · Zbl 1058.90049
[7] Clarke, F. H., Ledyaev, Y. S., Stern, R. J., & Wolenski, P. R. (2008). Nonsmooth analysis and control theory. New York: Springer Science & Business Media. · Zbl 1047.49500
[8] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121-2159. · Zbl 1280.68164
[9] Federer, H. (1959). Curvature measures. Transactions of the American Mathematical Society, 93(3), 418-491. , · Zbl 0089.38402
[10] García, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16, 1437-1480. · Zbl 1351.68209
[11] Gordon, G. J. (2013). Galerkin methods for complementarity problems and variational inequalities. arXiv preprint:1306.4753
[12] Harker, P. T., & Pang, J.-S. (1990). Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications. Mathematical Programming, 48(1-3), 161-220. , · Zbl 0734.90098
[13] Hendrix, E. M., & Boglárka, G. (2010). Introduction to nonlinear and global optimization. New York: Springer. , · Zbl 1193.90001
[14] Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237-285. ,
[15] Khobotov, E. N. (1987). Modification of the extra-gradient method for solving variational inequalities and certain optimization problems. USSR Computational Mathematics and Mathematical Physics, 27(5), 120-127. , · Zbl 0665.90078
[16] Kober, J., & Peters, J. R. (2009). Policy search for motor primitives in robotics. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems, 21 (pp. 849-856). Cambridge, MA: MIT Press. · Zbl 1237.68229
[17] Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., & Petrik, M. (2015). Finite-sample analysis of proximal gradient td algorithms. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. Corvallis, WA: AUAI Press.
[18] Mahadevan, S., Liu, B., Thomas, P., Dabney, W., Giguere, S., Jacek, N., … Liu, J. (2014). Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. arXiv preprint:1405.6757
[19] Noor, M. A. (2000). New approximation schemes for general variational inequalities. Journal of Mathematical Analysis and applications, 251(1), 217-229. , · Zbl 0964.49007
[20] Noor, M. A. (2004). Some developments in general variational inequalities. Applied Mathematics and Computation, 152(1), 199-277. , · Zbl 1134.49304
[21] Noor, M. A. (2009). Projection methods for nonconvex variational inequalities. Optimization Letters, 3(3), 411-418. , · Zbl 1171.58307
[22] Noor, M. A., Al-Said, E., Noor, K. I., & Yao, Y. (2011). Extragradient methods for solving nonconvex variational inequalities. Journal of Computational and Applied Mathematics, 235(9), 3104-3108. , · Zbl 1211.65082
[23] Peters, J., & Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71(7), 1180-1190. ,
[24] Poliquin, R., Rockafellar, R., & Thibault, L. (2000). Local differentiability of distance functions. Transactions of the American Mathematical Society, 352(11), 5231-5249. , · Zbl 0960.49018
[25] Robbins, H., & Siegmund, D. (1985). A convergence theorem for non negative almost supermartingales and some applications. In H. Robbins, Herbert Robbins selected papers (pp. 111-135). New York: Springer. ,
[26] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. · Zbl 1407.68009
[27] Thomas, P. S., Dabney, W. C., Giguere, S., & Mahadevan, S. (2013). Projected natural actor-critic. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in neural information processing systems, 21 (pp. 2337-2345). Red Hook, NY: Curran.
[28] Tobin, R. L. (1986). Sensitivity analysis for variational inequalities. Journal of Optimization Theory and Applications, 48(1), 191-204. , · Zbl 0557.49004
[29] Torrey, L., & Taylor, M. (2013). Teaching on a budget: Agents advising agents in reinforcement learning. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-Agent Systems (pp. 1053-1060). N.p.: International Foundation for Autonomous Agents and Multiagent Systems.
[30] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4), 229-256. , · Zbl 0772.68076
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.