×

ADMM-softmax: an ADMM approach for multinomial logistic regression. (English) Zbl 07192774

Summary: We present ADMM-Softmax, an alternating direction method of multipliers (ADMM) for solving multinomial logistic regression (MLR) problems. Our method is geared toward supervised classification tasks with many examples and features. It decouples the nonlinear optimization problem in MLR into three steps that can be solved efficiently. In particular, each iteration of ADMM-Softmax consists of a linear least-squares problem, a set of independent small-scale smooth, convex problems, and a trivial dual variable update. The solution of the least-squares problem can be accelerated by pre-computing a factorization or preconditioner, and the separability in the smooth, convex problem can be easily parallelized across examples. For two image classification problems, we demonstrate that ADMM-Softmax leads to improved generalization compared to a Newton-Krylov, a quasi Newton, and a stochastic gradient descent method.

MSC:

65J22 Numerical solution to inverse problems in abstract spaces
90C25 Convex programming
49M27 Decomposition methods
PDFBibTeX XMLCite
Full Text: DOI arXiv Link

References:

[1] M. BENZI,Preconditioning techniques for large linear systems: a survey, J. Comput. Phys., 182 (2002), pp. 418-477. · Zbl 1015.65018
[2] M. BENZI ANDM. TUMA˚,A robust incomplete factorization preconditioner for positive definite matrices, Numer. Linear Algebra Appl., 10 (2003), pp. 385-400. · Zbl 1071.65528
[3] L. BOTTOU, F. E. CURTIS,ANDJ. NOCEDAL,Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), pp. 223-311. · Zbl 1397.65085
[4] G. BOUCHARD,Efficient bounds for the softmax function, applications to inference in hybrid models, presentation at the workshop for approximate Bayesian inference in continuous/hybrid systems at NIPS-07, Whistler, Canada, Dec. 2007.
[5] S. BOYD, N. PARIKH, E. CHU, B. PELEATO,ANDJ. ECKSTEIN,Distributed optimization and statistical learning via the alternating direction method of multipliers, Found. Trends Mach. Learn., 3 (2010), pp. 1-122. · Zbl 1229.90122
[6] M. Á. CARREIRA-PERPIÑÁN ANDW. WANG,Distributed optimization of deeply nested systems, in Proceedings of the 17th International Conference on Artificial Intelligence and Statistics Reykjavik, S. Kaski and J. Corander, eds., vol. 33 of Proceedings of Machine Learning Research, PMLR, 2014, pp. 10-19.
[7] S. S. CHEN, D. L. DONOHO,ANDM. A. SAUNDERS,Atomic decomposition by basis pursuit, SIAM Rev., 43 (2001), pp. 129-159. · Zbl 0979.94010
[8] J. CHUNG, J. G. NAGY,ANDD. P. O’LEARY,A weighted-GCV method for Lanczos-hybrid regularization, Electron. Trans. Numer. Anal., 28 (2007/08), pp. 149-167. http://etna.mcs.kent.edu/vol.28.2007-2008/pp149-167.dir/pp149-167.pdf · Zbl 1171.65029
[9] J. DENG, W. DONG, R. SOCHER, L.-J. LI, K. LI,ANDL. FEI-FEI,Imagenet: A large-scale hierarchical image database, in 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE Conference Proceedings, Los Alamitos, 2009, pp. 248-255.
[10] D. L. DONOHO ANDI. M. JOHNSTONE,Adapting to unknown smoothness via wavelet shrinkage, J. Amer. Statist. Assoc., 90 (1995), pp. 1200-1224. · Zbl 0869.62024
[11] J. ECKSTEIN ANDD. P. BERTSEKAS,On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators, Math. Programming, 55 (1992), pp. 293-318. · Zbl 0765.90073
[12] F. FAGAN ANDG. IYENGAR,Unbiased scalable softmax optimization, Preprint on arXiv, 2018. https://arxiv.org/abs/1803.08577
[13] S. GAZZOLA, P. C. HANSEN,ANDJ. G. NAGY,IR Tools: a MATLAB package of iterative regularization methods and large-scale test problems, Numer. Algorithms, 81 (2019), pp. 773-811. · Zbl 1415.65003
[14] S. GAZZOLA, S. NOSCHESE, P. NOVATI,ANDL. REICHEL,Arnoldi decomposition, GMRES, and preconditioning for linear discrete ill-posed problems, Appl. Numer. Math., 142 (2019), pp. 102-121. · Zbl 1417.65119
[15] S. GAZZOLA, P. NOVATI,ANDM. R. RUSSO,On Krylov projection methods and Tikhonov regularization, Electron. Trans. Numer. Anal., 44 (2015), pp. 83-123. http://etna.mcs.kent.edu/vol.44.2015/pp83-123.dir/pp83-123.pdf · Zbl 1312.65065
[16] S. GAZZOLA, E. ONUNWOR, L. REICHEL,ANDG. RODRIGUEZ,On the Lanczos and Golub-Kahan reduction methods applied to discrete ill-posed problems, Numer. Linear Algebra Appl., 23 (2016), pp. 187-204. · Zbl 1413.65120
[17] G. H. GOLUB ANDC. F. VANLOAN,Matrix Computations, 3rd ed., Johns Hopkins University Press, Baltimore, 1996. · Zbl 0865.65009
[18] S. GOPAL ANDY. YANG,Distributed training of large-scale logistic models, in Proceedings of the 30th International Conference on Machine Learning Atlanta, S. Dasgupta and D. McAllester, eds., vol. 28 of Proceedings of Machine Learning Research, PMLR, pp. 289-297.
[19] E. HABER ANDL. RUTHOTTO,Stable architectures for deep neural networks, Inverse Problems, 34 (2018), Art. 014004, 22 pages. · Zbl 1426.68236
[20] E. T. HALE, W. YIN,ANDY. ZHANG,Fixed-point continuation forl1-minimization: methodology and convergence, SIAM J. Optim., 19 (2008), pp. 1107-1130. · Zbl 1180.65076
[21] P. C. HANSEN,Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion, SIAM, Philadelphia, 2005.
[22] P. C. HANSEN, J. G. NAGY,ANDD. P. O’LEARY,Deblurring Images: Matrices, Spectra, and Filtering, SIAM, Philadelphia, 2006. · Zbl 1112.68127
[23] T. HASTIE ANDR. TIBSHIRANI,Discriminant adaptive nearest neighbor classification and regression, in Advances in Neural Information Processing Systems 8 (NIPS 1995) , D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, eds., MIT Press, Cambridge, 1996, pp. 409-415.
[24] T. HASTIE, R. TIBSHIRANI,ANDJ. FRIEDMAN,The Elements of Statistical Learning, Springer, New York, 2001. · Zbl 0973.62007
[25] G.-B. HUANG, Q.-Y. ZHU,ANDC.-K. SIEW,Extreme learning machine: theory and applications, Neurocomputing, 70 (2006), pp. 489-501.
[26] M. E. KILMER ANDD. P. O’LEARY,Choosing regularization parameters in iterative methods for ill-posed problems, SIAM J. Matrix Anal. Appl., 22 (2001), pp. 1204-1221. · Zbl 0983.65056
[27] A. J. KLEYWEGT, A. SHAPIRO,ANDT. HOMEM-DEMELLO,The sample average approximation method for stochastic discrete optimization, SIAM J. Optim., 12 (2001/02), pp. 479-502. · Zbl 0991.90090
[28] K. KOH, S.-J. KIM,ANDS. BOYD,An interior-point method for large-scale l1-regularized logistic regression, J. Mach. Learn. Res., 8 (2007), pp. 1519-1555. · Zbl 1222.62092
[29] A. KRIZHEVSKY,Learning multiple layers of features from tiny images, Tech. Report, Department of Computer Science, University of Toronto, 2009.
[30] A. KRIZHEVSKY, I. SUTSKEVER,ANDG. E. HINTON,Imagenet classification with deep convolutional neural networks, in Proceedings of the 25th International Conference on Neural Information Processing Systems Vol. 1, F Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds., Curran Associates, Red Hook, 2012, pp. 1097-1105.
[31] Q. V. LE, J. NGIAM, A. COATES, A. LAHIRI, B. PROCHNOW,ANDA. Y. NG,On optimization methods for deep learning, in Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, L. Getoor and T. Scheffer, eds., Omnipress, Madison, 2011, pp. 265-272.
[32] T. H. LE, T. T. TRAN,ANDL. K. HUYNH,Identification of hindered internal rotational mode for complex ETNA
[33] Y. LECUN, L. BOTTOU, Y. BENGIO,ANDP. HAFFNER,Gradient-based learning applied to document recognition, Proc. IEEE, 86 (1998), pp. 2278-2324.
[34] Y. LECUN, C. CORTES,ANDC. J. BURGES,The MNIST database of handwritten digits, Online recource, 1998.http://yann.lecun.com/exdb/mnist/
[35] J. LIAO ANDK.-V. CHIN,Logistic regression for disease classification using microarray data: model selection in a large p and small n case, Bioinformatics, 23 (2007), pp. 1945-1951.
[36] R. MALOUF,A comparison of algorithms for maximum entropy parameter estimation, in Proceedings of the 6th Conference on Natural Language Learning COLING-02, Vol. 20, Association for Computational Linguistics, 2002, pp. 1-7.
[37] F. MELGANI ANDL. BRUZZONE,Classification of hyperspectral remote sensing images with support vector machines, IEEE Trans. Geosci. Remote Sens., 42 (2004), pp. 1778-1790.
[38] J. G. NAGY, K. PALMER,ANDL. PERRONE,Iterative methods for image deblurring: a Matlab object-oriented approach, Numer. Algorithms, 36 (2004), pp. 73-93. · Zbl 1048.65039
[39] J. NOCEDAL ANDS. J. WRIGHT,Numerical Optimization, Springer, New York, 1999. · Zbl 0930.65067
[40] L. PARRA, C. SPENCE,ANDP. SAJDA,Higher-order statistical properties arising from the nonstationarity of natural signals, In Advances in Neural Information Processing Systems 13 (NIPS 2000), T.K. Leen, T.G. Dietterich, and V. Tresp., eds., NIPS Proceedings, 2001, pp. 786-792.
[41] P. RAMAN, S. MATSUSHIMA, X. ZHANG, H. YUN,ANDS. V. N. VISHWANATHAN,DS-MLR: exploiting double separability for scaling up distributed multinomial logistic regression, Preprint on arXiv/CoRR, 2016.http://arxiv.org/abs/1604.04706
[42] H. ROBBINS ANDS. MONRO,A stochastic approximation method, Ann. Math. Statistics, 22 (1951), pp. 400- 407. · Zbl 0054.05901
[43] Y. SAAD,Iterative Methods for Sparse Linear Systems, 2nd ed., SIAM, Philadelphia, 2003. · Zbl 1031.65046
[44] J. SHI, W. YIN, S. OSHER,ANDP. SAJDA,A fast hybrid algorithm for large-scale l1-regularized logistic regression, J. Mach. Learn. Res., 11 (2010), pp. 713-741. · Zbl 1242.62078
[45] M. TADDY,Distributed multinomial regression, Ann. Appl. Stat., 9 (2015), pp. 1394-1414. · Zbl 1454.62036
[46] G. TAYLOR, R. BURMEISTER, Z. XU, B. SINGH, A. PATEL,ANDT. GOLDSTEIN,Training neural networks without gradients: a scalable ADMM approach, in Proceedings of the 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger, eds., JMLR, New York, 2016, pp. 2722-2731.
[47] R. TIBSHIRANI,Regression shrinkage and selection via the lasso, J. Roy. Statist. Soc. Ser. B, 58 (1996), pp. 267-288. · Zbl 0850.62538
[48] Y. TSURUOKA, J. MCNAUGHT, J. TSUJII,ANDS. ANANIADOU,Learning string similarity measures for gene/protein name dictionary look-up using logistic regression, Bioinformatics, 23 (2007), pp. 2768-2774.
[49] XTRACTOPEN,Meganet, Matlab package, 2018.https://github.com/XtractOpen,
[50] J. YOSINSKI, J. CLUNE, Y. BENGIO,ANDH. LIPSON,How transferable are features in deep neural networks?, in Proceedings of the 27th International Conference on Neural Information Processing Systems NIPS’14, Vol. 2, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds., MIT Press, Cambridge, 2014, pp. 3320-3328.
[51] G.-X. YUAN, K.-W. CHANG, C.-J. HSIEH,ANDC.-J. LIN,A comparison of optimization methods and software for large-scale L1-regularized linear classification, J. Mach. Learn. Res., 11 (2010), pp. 3183- 3234. · Zbl 1242.62065
[52] M. YUAN ANDY. LIN,Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Ser. B Stat. Methodol., 68 (2006), pp. 49-67. · Zbl 1141.62030
[53] T.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.