Stochastic primal-dual coordinate method for regularized empirical risk minimization.

*(English)*Zbl 1440.62314Summary: We consider a generic convex optimization problem associated with regularized empirical risk minimization of linear predictors. The problem structure allows us to reformulate it as a convex-concave saddle point problem. We propose a stochastic primal-dual coordinate (SPDC) method, which alternates between maximizing over a randomly chosen dual variable and minimizing over the primal variables. An extrapolation step on the primal variables is performed to obtain accelerated convergence rate. We also develop a mini-batch version of the SPDC method which facilitates parallel computing, and an extension with weighted sampling probabilities on the dual variables, which has a better complexity than uniform sampling on unnormalized data. Both theoretically and empirically, we show that the SPDC method has comparable or better performance than several state-of-the-art optimization methods.

##### MSC:

62L20 | Stochastic approximation |

62C20 | Minimax procedures in statistical decision theory |

90C15 | Stochastic programming |

47N10 | Applications of operator theory in optimization, convex analysis, mathematical programming, economics |

##### Keywords:

empirical risk minimization; randomized algorithms; convex-concave saddle point problems; primal-dual algorithms; computational complexity##### Software:

ElemStatLearn##### References:

[1] | Alekh Agarwal and L´eon Bottou. A lower bound for the optimization of finite sums. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 78–86, Lille, France, 2015. |

[2] | Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1200–1205, Montreal, Canada, June 2017. · Zbl 1369.68273 |

[3] | Amir Beck and Marc Teboulle. A fast iterative shrinkage-threshold algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. · Zbl 1175.94009 |

[4] | Dimitri P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, Ser. B, 129:163–195, 2011. · Zbl 1229.90121 |

[5] | Dimitri P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. In S. Sra, S. Nowozin, and S. J. Wright, editors, Optimization for Machine Learning, chapter 4. The MIT Press, 2012. |

[6] | Doron Blatt, Alfred Hero, and Hillel Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 18(1):29–51, 2007. · Zbl 1154.90015 |

[7] | L´eon Bottou.Large-scale machine learning with stochastic gradient descent.In Yves Lechevallier and Gilbert Saporta, editors, Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), pages 177–187, Paris, France, August 2010. Springer. |

[8] | L´eon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 161–168. MIT Press, Cambridge, MA, 2008. |

[9] | Olivier Bousquet and Andr´e Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499–526, 2002. · Zbl 1007.68083 |

[10] | Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2010. · Zbl 1229.90122 |

[11] | Antonin Chambolle and Thomas Pock.A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40 (1):120–145, 2011. · Zbl 1255.68217 |

[12] | Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. Coordinate descent method for largescale l2-loss linear support vector machines. Journal of Machine Learning Research, 9: 1369–1398, 2008. · Zbl 1225.68157 |

[13] | Min-Te Chao. A general purpose unequal probability sampling plan. Biometrika, 69(3): 653–656, 1982. 38 · Zbl 0512.62018 |

[14] | Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646–1654. 2014. |

[15] | John Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2873–2898, 2009. · Zbl 1235.62151 |

[16] | Rong-En Fan and Chih-Jen Lin. LIBSVM data: Classification, regression and multi-label. URL: http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets, 2011. |

[17] | Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of The 32nd International Conference on Machine Learning (ICML), pages 2540– 2548. 2015. |

[18] | Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, 2nd edition, 2009. · Zbl 1273.62005 |

[19] | Jean-Baptiste Hiriart-Urruty and Claude Lemar´echal. Fundamentals of Convex Analysis. Springer, 2001. · Zbl 0998.49001 |

[20] | Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th International Conference on Machine Learning (ICML), pages 408–415, 2008. |

[21] | Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pages 315–323. 2013. |

[22] | Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Technical report, Department of Industrial and System Engineering, University of Florida, July 2015. · Zbl 1432.90115 |

[23] | John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777–801, 2009. · Zbl 1235.68167 |

[24] | Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems 25, pages 2672–2680. 2012. |

[25] | Hongzhou Lin, Julien Mairal, and Zaid Harchaoui.A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems 28, pages 3384– 3392. 2015a. |

[26] | Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM Journal on Optimization, 25(4):2244–2273, 2015b. · Zbl 1329.65127 |

[27] | P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964–979, December 1979. 39 · Zbl 0426.65050 |

[28] | Angelia Nedi´c and Dimitri P. Bertsekas. Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, 12(1):109–138, 2001. · Zbl 0991.90099 |

[29] | Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Mathematical Programming, 155 (1-2):549–573, 2016. · Zbl 1333.65070 |

[30] | Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro.Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. · Zbl 1189.90109 |

[31] | Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston, 2004. · Zbl 1086.90045 |

[32] | Yurii Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, 103:127–152, 2005. · Zbl 1079.90102 |

[33] | Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012. · Zbl 1257.90073 |

[34] | Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, Ser. B, 140:125–161, 2013. · Zbl 1287.90067 |

[35] | Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006. · Zbl 1104.65059 |

[36] | Hua Ouyang, Niao He, Long Tran, and Alexander Gray. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, USA, 2013. |

[37] | John Platt. Fast training of support vector machine using sequential minimal optimization. In B. Sch¨olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 185–208. MIT Press, Cambridge, MA, USA, 1999. |

[38] | Boris T. Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30:838–855, 1992. · Zbl 0762.62022 |

[39] | Zheng Qu, Peter Richt´arik, and Tong Zhang. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Advances in Neural Information Processing Systems 28, pages 865–873. 2015. |

[40] | Peter Richt´arik and Martin Tak´aˇc. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 (1):1–38, 2014. · Zbl 1301.65051 |

[41] | Peter Richt´arik and Martin Tak´aˇc. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1):433–484, 2016. · Zbl 1342.90102 |

[42] | Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Technical Report HAL 00860051, INRIA, Paris, France, 2013. 40 · Zbl 1417.90099 |

[43] | Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567–599, 2013a. · Zbl 1307.68073 |

[44] | Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 378–385. 2013b. · Zbl 1307.68073 |

[45] | Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1):105–145, 2015. · Zbl 1342.90103 |

[46] | Taiji Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 392–400, Atlanta, GA, USA, 2013. |

[47] | Taiji Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the 31st International Conference on Machine Learning (ICML), pages 736–744, Beijing, 2014. |

[48] | Martin Tak´aˇc, Avleen Bijral, Peter Richt´arik, and Nathan Srebro. Mini-batch primal and dual methods for SVMs. In Proceedings of the 30th International Conference on Machine Learning (ICML), 2013. |

[49] | Paul Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsiz rule. SIAM Journal on Optimization, 8(2):506–531, 1998. · Zbl 0922.90131 |

[50] | Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. Unpublished manuscript, 2008. |

[51] | Huahua Wang and Arindam Banerjee. Online alternating direction method. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 1119–1126, Edinburgh, Scotland, UK, 2012. |

[52] | Blake Woodworth and Nathan Srebro. Tight complexity bounds for optimizing composite objectives. arXiv:1605.08003, 2016. |

[53] | Stephen J. Wright. Coordinate descent algorithms. Mathematical Programming, Series B, 151(1):3–34, 2015. · Zbl 1317.49038 |

[54] | Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2534–2596, 2010. · Zbl 1242.62011 |

[55] | Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014. · Zbl 1321.65016 |

[56] | Tianbao Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems 26, pages 629–637. 2013. 41 |

[57] | Adams Wei Yu, Qihang Lin, and Tianbao Yang.Double stochastic primal-dual coordinate method for regularized empirical risk minimization with factorized data. arXiv:1508.03390, 2015. |

[58] | Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21st International Conference on Machine Learning (ICML), pages 116–123, Banff, Alberta, Canada, 2004. |

[59] | Xiaoqun Zhang, Martin Burger, and Stanley Osher. A unifoed primal-dual algorithm framework based on Bregman iteration. Journal of Scientific Computing, 46(1):20–46, January 2011. · Zbl 1227.65052 |

[60] | Yuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In Proceedings of The 32nd International Conference on Machine Learning (ICML), pages 353–361. 2015. · Zbl 1440.62314 |

[61] | Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Proceedings, pages 1–9. JMLR.org, 2015. |

[62] | Leon Wenliang Zhong and James T. Kwok. Fast stochastic alternating direction method of multipliers. In Proceedings of the 31st International Conference on Machine Learning (ICML), pages 46–54, Beijing, China, 2014. |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.