×

Majorization-minimization algorithms for nonsmoothly penalized objective functions. (English) Zbl 1267.65009

Summary: The use of penalization, or regularization, has become common in high-dimensional statistical analysis, where an increasingly frequent goal is to simultaneously select important variables and estimate their effects. It has been shown by several authors that these goals can be achieved by minimizing some parameter-dependent “goodness-of-fit” function (e.g. a negative loglikelihood) subject to a penalization that promotes sparsity. Penalty functions that are singular at the origin have received substantial attention, arguably beginning with the Lasso penalty. The current literature tends to focus on specific combinations of differentiable goodness-of-fit functions and penalty functions singular at the origin. One result of this combined specificity has been a proliferation in the number of computational algorithms designed to solve fairly narrow classes of optimization problems involving objective functions that are not everywhere continuously differentiable.
In this paper, we propose a general class of algorithms for optimizing an extensive variety of nonsmoothly penalized objective functions that satisfy certain regularity conditions. The proposed framework utilizes the majorization-minimization (MM) algorithm as its core optimization engine. In the case of penalized regression models, the resulting algorithms employ iterated soft-thresholding, implemented componentwise, allowing for fast and stable updating that avoids the need for inverting high-dimensional matrices. We establish convergence theory under weaker assumptions than previously considered in the statistical literature. We also demonstrate the exceptional effectiveness of new acceleration methods, originally proposed for the expectation and maximization (EM) algorithm, in this class of problems. Simulation results and a microarray data example are provided to demonstrate the algorithm’s capabilities and versatility.

MSC:

65C60 Computational problems in statistics (MSC2010)
62J07 Ridge regression; shrinkage estimators (Lasso)
62J05 Linear regression; mixed models
62J12 Generalized linear models (logistic models)
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson, J., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O. and Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling., Nature 403 503-511.
[2] Andersen, P. K., Borgan, O., Gill, R. D. and Keiding, N. (1993)., Statistical Models Based on Counting Processes. Springer-Verlag, New York. · Zbl 0769.62061
[3] Ando, T., Suguro, M., Kobayashi, T., Seto, M. and Honda, H. (2003). Multiple fuzzy neural network system for outcome prediction and classification of 220 lymphoma patients on the basis of molecular profiling., Cancer Sci. 94 906-913.
[4] Annest, A., Bumgarner, R., Raftery, A. and Yeung, K. Y. (2009). Iterative Bayesian Model Averaging: a method for the application of survival analysis to high-dimensional microarray data., BMC Bioinformatics 10 72.
[5] Antoniadis, A., Gijbels, I. and Nikolova, M. (2009). Penalized likelihood regression for generalized linear models with nonquadratic penalties., Ann. Inst. Statist. Math. 1-31. 10.1007/s10463-009-0242-4. · Zbl 1333.62113 · doi:10.1007/s10463-009-0242-4
[6] Becker, M. P., Yang, I. and Lange, K. (1997). EM algorithms without missing data., Stat. Methods Med. Res. 6 38-54.
[7] Binder, H. and Schumacher, M. (2008). Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models., BMC Bioinformatics 9 14. · Zbl 1276.62060 · doi:10.2202/1544-6115.1346
[8] Binder, H. and Schumacher, M. (2009). Incorporating pathway information into boosting estimation of high-dimensional risk prediction models., BMC Bioinformatics 10 18.
[9] Bohning, D. and Lindsay, B. (1988). Monotonocity of Quadratic-Approximation Algorithms., Ann. Inst. Statist. Math. 40 641-663. · Zbl 0723.65150 · doi:10.1007/BF00049423
[10] Boyd, S. and Vandenberghe, L. (2004)., Convex Optimization. Cambridge University Press. · Zbl 1058.90049
[11] Chrétien, S. and Hero, A. O. (2008). On EM algorithms and their proximal generalizations., ESAIM Journ. on Probability and Statistics 12 308-326. · Zbl 1187.65006 · doi:10.1051/ps:2007041
[12] Clarke, F. H. (1990)., Optimization and Nonsmooth Analysis. SIAM, Philadelphia. · Zbl 0696.49002
[13] Combettes, P. L. and Wajs, V. R. (2005). Signal Recovery by Proximal Forward-Backward Splitting., Multiscale Model. Simul. 4 1168-1200. · Zbl 1179.94031 · doi:10.1137/050626090
[14] Cox, D. R. (1972). Regression models and life-tables (with Discussion)., J. Roy. Statist. Soc. Ser. B 34 187-220. · Zbl 0243.62041
[15] Cox, D. R. (1975). Partial likelihood., Biometrika 62 269-276. · Zbl 0312.62002 · doi:10.1093/biomet/62.2.269
[16] Daubechies, I., Defreise, M. and De Mol, C. (2004). An Iterative Thresholding Algorithm for Linear Inverse Problems with a Sparsity Constraint., Commun. Pure Appl. Math. 1413-1457. · Zbl 1077.65055 · doi:10.1002/cpa.20042
[17] De Mol, C., De Vito, E. and Rosasco, L. (2008). Elastic-Net Regularization in Learning Theory., arXiv . · Zbl 1319.62087
[18] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion)., J. Roy. Statist. Soc. Ser. B 39 1-38. · Zbl 0364.62022
[19] Efron, B., Hastie, T., Johnstone, L. and Tibshirani, R. (2004). Least angle regression., Ann. Statist. 32 407-452. · Zbl 1091.62054 · doi:10.1214/009053604000000067
[20] Engler, D. and Li, Y. (2009). Survival analysis with high-dimensional covariates: an application in microarray studies., Statist. Appl. Gen. Mol. Biol. 8 Article 14. · Zbl 1276.62067 · doi:10.2202/1544-6115.1423
[21] Fan, J. and Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties., J. Amer. Statist. Assoc. 96 1348-1360. · Zbl 1073.62547 · doi:10.1198/016214501753382273
[22] Fan, J., Feng, Y., Samworth, R. and Wu, Y. (2009a). SIS: Sure Independence Screening R package version, 0.2.
[23] Fan, J., Samworth, R., and Wu, Y. (2009b). Ultrahigh dimensional variable selection: beyond the linear model., Journal of Machine Learning Research 10 1829-1853. · Zbl 1235.62089
[24] Friedman, J., Hastie, T. and Tibshirani, R. (2008). Regularization Paths for Generalized Linear Models via Coordinate Descent Technical Report, Dept. of Statistics, Stanford, University.
[25] Fygenson, M. and Ritov, Y. (1994). Monotone estimating equations for censored data., Ann. Statist. 22 732-746. · Zbl 0807.62032 · doi:10.1214/aos/1176325493
[26] Geman, D. and Reynolds, G. (1992). Constrained restoration and the recovery of discontinuities., IEEE Trans. Patt. Anal. Mach. Intell 14 367-383.
[27] Gui, J. and Li, H. (2005a). Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data., Bioinformatics 21 3001-3008.
[28] Gui, J. and Li, H. (2005b). Threshold Gradient Descent Method for Censored Data Regression with Applications in Pharmacogenomics., Pacific Symposium on Biocomputing 10 272-283.
[29] Hale, E. T., Yin, W. and Zhang, Y. (2008). Fixed-point Continuation for l1-minimization: Methodology and Convergence., SIAM J. Optim. 19 1107-1130. · Zbl 1180.65076 · doi:10.1137/070698920
[30] Hastie, T. and Efron, B. (2007). lars: Least Angle Regression, Lasso and Forward Stagewise R package version, 0.9-7.
[31] Hiriart-Urruty, J. B. and Lemaréchal, C. (1996)., Convex Analysis and Minimization Algorithms I: Fundamentals. Springer.
[32] Hunter, D. R. and Lange, K. (2004). A tutorial on MM algorithms., Amer. Statist. 58 30-37. · Zbl 05680564 · doi:10.1198/0003130042836
[33] Hunter, D. and Li, R. (2005). Variable Selection Using MM Algorithms., Ann. Statist. 33 1617-1643. · Zbl 1078.62028 · doi:10.1214/009053605000000200
[34] Johnson, B. A., Lin, D.-Y. and Zeng, D. (2008). Penalized estimating functions and variable selection in semiparametric regression problems., J. Amer. Statist. Assoc. 103 672-680. · Zbl 1471.62330 · doi:10.1198/016214508000000184
[35] Johnson, L. M. and Strawderman, R. L. (2009). Induced smoothing for the semiparametric accelerated failure time model: asymptotics and extensions to clustered data., Biometrika 96 577-590. · Zbl 1170.62069 · doi:10.1093/biomet/asp025
[36] Kim, Y., Choi, H. and Oh, H.-S. (2008). Smoothly Clipped Absolute Deviation on High Dimensions., J. Amer. Statist. Assoc. 103 1665-1673. · Zbl 1286.62062 · doi:10.1198/016214508000001066
[37] Lange, K. (1995). A Gradient Algorithm Locally Equivalent to the EM Algorithm., J. Roy. Statist. Soc. Ser. B 57 425-437. · Zbl 0813.62021
[38] Lange, K. (2004)., Optimization. Springer, New York, USA. · Zbl 1140.90004
[39] Lange, K., Hunter, D. and Yang, I. (2000). Optimization Transfer Using Surrogate Objective Functions., J. of Comp. Graph. Statist. 9 1-20.
[40] Li, H. and Gui, J. (2004). Partial Cox regression analysis for high-dimensional microarray gene expression data., Bioinformatics 20 Suppl. 1, i208-215.
[41] Li, H. and Luan, Y. (2005). Boosting proportional hazards models using smoothing splines, with applications to high-dimensional microarray data., Bioinformatics 21 2403-2409.
[42] Luenberger, D. G. and Ye, Y. (2008)., Linear and nonlinear programming. , Third ed. International Series in Operations Research & Management Science, 116 . Springer, New York. · Zbl 1207.90003
[43] Ma, S. and Huang, J. (2007). Additive risk survival model with microarray data., BMC Bioinformatics 8 192.
[44] Martinez-Climent, J. A., Alizadeh, A. A., Segraves, R., Blesa, D., Rubio-Moscardo, F., Albertson, D. G., Garcia-Conde, J., Dyer, M. J., Levy, R., Pinkel, D. and Lossos, I. S. (2003). Transformation of follicular lymphoma to diffuse large cell lymphoma is associated with a heterogeneous set of DNA copy number and gene expression alterations., Blood 101 3109-3117.
[45] Mazumder, R., Friedman, J. and Hastie, T. (2009). SparseNet: Coordinate Descent with Non-Convex Penalties Technical Report, Department of Statistics, Stanford, University. · Zbl 1229.62091
[46] McLachlan, G. J. and Krishnan, T. (2008)., The EM Algorithm and Extensions. , 2 ed. Wiley-Interscience. · Zbl 1165.62019
[47] Meyer, R. R. (1976). Sufficient conditions for the convergence of monotonic mathematical programming algorithms., J. Comput. System. Sci. 12 108-121. · Zbl 0337.65037 · doi:10.1016/S0022-0000(76)80021-9
[48] Meyer, R. R. (1977). A comparison of the forcing function and point-to-set mapping approaches to convergence analysis., SIAM J. Control Optimization 15 699-715. · Zbl 0363.90099 · doi:10.1137/0315045
[49] Nikolova, M. (2000). Local strong homogeneity of a regularized estimator., SIAM J. Appl. Math. 61 633-658. · Zbl 0991.94015 · doi:10.1137/S0036139997327794
[50] Ortega, and Rheinboldt, (1970)., Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York. · Zbl 0241.65046
[51] Park, M. Y. and Hastie, T. (2007). L1-regularization path algorithm for generalized linear models., J. Roy. Statist. Soc. Ser. B 69 659-677. · doi:10.1111/j.1467-9868.2007.00607.x
[52] Polak, E. (1987). On the mathematical foundations of nondifferentiable optimization in engineering design., SIAM Rev. 29 21-89. · doi:10.1137/1029002
[53] Roland, C. and Varadhan, R. (2005). New iterative schemes for nonlinear fixed point problems, with applications to problems with bifurcations and incomplete-data problems., Appl. Numer. Math. 55 215-226. · Zbl 1119.65335 · doi:10.1016/j.apnum.2005.02.006
[54] Rosenwald, A., Wright, G., Chan, W. C., Connors, J. M., Campo, E., Fisher, R. I., Gascoyne, R. D., Muller-Hermelink, H. K., Smeland, E. B., Giltnane, J. M., Hurt, E. M., Zhao, H., Averett, L., Yang, L., Wilson, W. H., Jaffe, E. S., Simon, R., Klausner, R. D., Powell, J., Duffey, P. L., Longo, D. L., Greiner, T. C., Weisenburger, D. D., Sanger, W. G., Dave, B. J., Lynch, J. C., Vose, J., Armitage, J. O., Montserrat, E., López-Guillermo, A., Grogan, T. M., Miller, T. P., LeBlanc, M., Ott, G., Kvaloy, S., Delabie, J., Holte, H., Krajci, P., Stokke, T. and Staudt, L. M. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma., N. Engl. J. Med. 346 1937-1947.
[55] Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths., Ann. Statist. 35 1012-1030. · Zbl 1194.62094 · doi:10.1214/009053606000001370
[56] Saeki, K., Miura, Y., Aki, D., Kurosaki, T. and Yoshimura, A. (2003). The B cell-specific major raft protein, Raftlin, is necessary for the integrity of lipid raft and BCR signal transduction., EMBO J. 22 3015-3026.
[57] Schifano, E. D. (2010). Topics in Penalized Estimation PhD thesis, Cornell, University.
[58] Sha, N., Tadesse, M. G. and Vannucci, M. (2006). Bayesian variable selection for the analysis of microarray data with censored outcomes., Bioinformatics 22 2262-2268.
[59] She, Y. (2009). Thresholding-based Iterative Selection Procedures for Model Selection and Shrinkage., Electronic Journal of Statistics 3 384-415. · Zbl 1326.62158 · doi:10.1214/08-EJS348
[60] Sinisi, S. E., Neugebauer, R. and van der Laan, M. J. (2006). Cross-validated bagged prediction of survival., Statist. Appl. Gen. Mol. Biol. 5 Article 12. · Zbl 1166.62326 · doi:10.2202/1544-6115.1180
[61] Sohn, I., Kim, J., Jung, S. H. and Park, C. (2009). Gradient lasso for Cox proportional hazards model., Bioinformatics 25 1775-1781.
[62] Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso., J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538
[63] Tibshirani, R. J. (2009). Univariate shrinkage in the cox model for high dimensional data., Statist. Appl. Gen. Mol. Biol. 8 Article21. · Zbl 1276.62096 · doi:10.2202/1544-6115.1438
[64] Tseng, P. (2004). An analysis of the EM algorithm and entropy-like proximal point methods., Mathematics of Operations Research 29 27-44. · Zbl 1082.90092 · doi:10.1287/moor.1030.0073
[65] Tseng, P. and Yun, S. (2009). A coordinate gradient descent method for nonsmooth separable minimization., Mathematical Programming B 117 387-423. · Zbl 1166.90016 · doi:10.1007/s10107-007-0170-0
[66] Vaida, F. (2005). Parameter convergence for EM and MM algorithms., Statist. Sinica 15 831-840. · Zbl 1087.62035
[67] Varadhan, R. and Roland, C. (2008). Simple and Globally Convergent Methods for Accelerating the Convergence of Any EM Algorithm., Scand. J. Statist. 35 335-353. · Zbl 1164.65006 · doi:10.1111/j.1467-9469.2007.00585.x
[68] Wu, C. F. J. (1983). On the convergence properties of the EM algorithm., Ann. Statist. 11 95-103. · Zbl 0517.62035 · doi:10.1214/aos/1176346060
[69] Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression., Ann. Appl. Stat. 2 224-244. · Zbl 1137.62045 · doi:10.1214/07-AOAS174SUPP
[70] Zangwill, W. I. (1969)., Nonlinear Programming; a Unified Approach. Prentice-Hall International Series in Management, Englewood Cliffs, N.J. · Zbl 0195.20804
[71] Zhang, C.-H. (2008). Discussion of “One-step sparse estimates in nonconcave penalized likelihood models” by H. Zou and R. Li., Ann. Statist. 36 1553-1560. · Zbl 1282.62110 · doi:10.1214/07-AOS0316C
[72] Zhang, C.-H. (2010). Nearly Unbiased Variable Selection Under Minimax Concave Penalty., Ann. Statist. 38 894-942. · Zbl 1183.62120 · doi:10.1214/09-AOS729
[73] Zhang, D. and Zhang, M. (2007). Bayesian profiling of molecular signatures to predict event times., Theor. Biol. Med. Model. 4 3.
[74] Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties., J. Amer. Statist. Assoc. 101 1418-1429. · Zbl 1171.62326 · doi:10.1198/016214506000000735
[75] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net., J. Roy. Statist. Soc. Ser. B 67 301-320. · Zbl 1069.62054 · doi:10.1111/j.1467-9868.2005.00503.x
[76] Zou, H. and Hastie, T. (2008). elasticnet: Elastic-Net for Sparse Estimation and Sparse PCA R package version, 1.0-5.
[77] Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models., Ann. Statist. 36 1509-1533. · Zbl 1142.62027 · doi:10.1214/009053607000000802
[78] Zou, H. and Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters., Ann. Statist. 34 1733-1751. · Zbl 1168.62064 · doi:10.1214/08-AOS625
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.