zbMATH — the first resource for mathematics

Least angle regression. (With discussion). (English) Zbl 1091.62054
Summary: The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived:
(1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principle choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.

MSC:
 62J05 Linear regression; mixed models 62J07 Ridge regression; shrinkage estimators (Lasso) 65C60 Computational problems in statistics (MSC2010)
Software:
 [1] Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees . Wadsworth, Belmont, CA. · Zbl 0541.62042 [2] Efron, B. (1986). How biased is the apparent error rate of a prediction rule? J. Amer. Statist. Assoc. 81 461–470. · Zbl 0621.62073 · doi:10.2307/2289236 [3] Efron, B. and Tibshirani, R. (1997). Improvements on cross-validation: The $$.632+$$ bootstrap method. J. Amer. Statist. Assoc. 92 548–560. · Zbl 0887.62044 · doi:10.2307/2965703 [4] Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of online learning and an application to boosting. J. Comput. System Sci. 55 119–139. · Zbl 0880.68103 · doi:10.1006/jcss.1997.1504 [5] Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189–1232. · Zbl 1043.62034 · doi:10.1214/aos/1013203451 [6] Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Ann. Statist. 28 337–407. · Zbl 1106.62323 · doi:10.1214/aos/1016218223 · euclid:aos/1016218223 [7] Golub, G. and Van Loan, C. (1983). Matrix Computations . Johns Hopkins Univ. Press, Baltimore, MD. · Zbl 0559.65011 [8] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning : Data Mining , Inference and Prediction . Springer, New York. · Zbl 0973.62007 [9] Lawson, C. and Hanson, R. (1974). Solving Least Squares Problems . Prentice-Hall, Englewood Cliffs, NJ. · Zbl 0860.65028 [10] Mallows, C. (1973). Some comments on $$C_p$$. Technometrics 15 661–675. · Zbl 0269.62061 · doi:10.2307/1267380 [11] Meyer, M. and Woodroofe, M. (2000). On the degrees of freedom in shape-restricted regression. Ann. Statist. 28 1083–1104. · Zbl 1105.62340 · doi:10.1214/aos/1015956708 [12] Osborne, M., Presnell, B. and Turlach, B. (2000a). A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20 389–403. · Zbl 0962.65036 · doi:10.1093/imanum/20.3.389 [13] Osborne, M. R., Presnell, B. and Turlach, B. (2000b). On the LASSO and its dual. J. Comput. Graph. Statist. 9 319–337. [14] Rao, C. R. (1973). Linear Statistical Inference and Its Applications , 2nd ed. Wiley, New York. · Zbl 0256.62002 [15] Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9 1135–1151. JSTOR: · Zbl 0476.62035 · doi:10.1214/aos/1176345632 · links.jstor.org [16] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B. 58 267–288. · Zbl 0850.62538 [17] Weisberg, S. (1980). Applied Linear Regression . Wiley, New York. · Zbl 0529.62054 [18] Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. J. Amer. Statist. Assoc. 93 120–131. · Zbl 0920.62056 · doi:10.2307/2669609 [19] Breiman, L. (1992). The little bootstrap and other methods for dimensionality selection in regression: $$X$$-fixed prediction error. J. Amer. Statist. Assoc. 87 738–754. · Zbl 0850.62518 · doi:10.2307/2290212 [20] George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881–889. [21] Ishwaran, H. and Rao, J. S. (2000). Bayesian nonparametric MCMC for large variable selection problems. Unpublished manuscript. [22] Ishwaran, H. and Rao, J. S. (2003). Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Amer. Statist. Assoc. 98 438–455. · Zbl 1041.62090 · doi:10.1198/016214503000224 [23] Mallows, C. (1973). Some comments on $$C_p$$. Technometrics 15 661–675. · Zbl 0269.62061 · doi:10.2307/1267380 [24] Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression (with discussion). J. Amer. Statist. Assoc. 83 1023–1036. · Zbl 0673.62051 · doi:10.2307/2290129 [25] Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486–494. · Zbl 0773.62051 · doi:10.2307/2290328 [26] Breiman, L. (1996). Bagging predictors. Machine Learning 24 123–140. · Zbl 0858.68080 [27] Bühlmann, P. and Yu, B. (2002). Analyzing bagging. Ann. Statist. 30 927–961. · Zbl 1029.62037 · doi:10.1214/aos/1031689014 · euclid:aos/1031689014 [28] Abramovich, F., Benjamini, Y., Donoho, D. and Johnstone, I. (2000). Adapting to unknown sparsity by controlling the false discovery rate. Technical Report 2000–19, Dept. Statistics, Stanford Univ. · Zbl 1092.62005 [29] Akaike, H. (1973). Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika 60 255–265. · Zbl 0318.62075 · doi:10.1093/biomet/60.2.255 [30] Birgé, L. and Massart, P. (2001a). Gaussian model selection. J. Eur. Math. Soc. 3 203–268. · Zbl 1037.62001 · doi:10.1007/s100970100031 [31] Birgé, L. and Massart, P. (2001b). A generalized $$C_p$$ criterion for Gaussian model selection. Technical Report 647, Univ. Paris 6 & 7. · Zbl 1037.62001 [32] Foster, D. and George, E. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947–1975. JSTOR: · Zbl 0829.62066 · doi:10.1214/aos/1176325766 · links.jstor.org [33] Knight, K. and Fu, B. (2000). Asymptotics for Lasso-type estimators. Ann. Statist. 28 1356–1378. · Zbl 1105.62357 · doi:10.1214/aos/1015957397 [34] Loubes, J.-M. and van de Geer, S. (2002). Adaptive estimation with soft thresholding penalties. Statist. Neerlandica 56 453–478. · Zbl 1090.62534 · doi:10.1111/1467-9574.00212 [35] Mallows, C. (1973). Some comments on $$C_p$$. Technometrics 15 661–675. · Zbl 0269.62061 · doi:10.2307/1267380 [36] van de Geer, S. (2001). Least squares estimation with complexity penalties. Math. Methods Statist. 10 355–374. · Zbl 1005.62043 [37] Breiman, L. (2001). Random forests. Available at ftp://ftp.stat.berkeley.edu/pub/users/breiman/ randomforest2001.pdf. · Zbl 1007.68152 [38] Fu, W. J. (1998). Penalized regressions: The Bridge versus the Lasso. J. Comput. Graph. Statist. 7 397–416. [39] Osborne, M. R., Presnell, B. and Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20 389–403. · Zbl 0962.65036 · doi:10.1093/imanum/20.3.389 [40] Ridgeway, G. (2003). GBM 0.7-2 package manual. Available at http://cran.r-project.org/doc/ packages/gbm.pdf. [41] Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation 11 1493–1517. [42] Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119–139. · Zbl 0880.68103 · doi:10.1006/jcss.1997.1504 [43] Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189–1232. · Zbl 1043.62034 · doi:10.1214/aos/1013203451 [44] Mason, L., Baxter, J., Bartlett, P. and Frean, M. (2000). Boosting algorithms as gradient descent. In Advances in Neural Information Processing Systems 12 512–518. MIT Press, Cambridge, MA. [45] Rosset, S. and Zhu, J. (2004). Piecewise linear regularized solution paths. Advances in Neural Information Processing Systems 16 . · Zbl 1194.62094 · doi:10.1214/009053606000001370 [46] Rosset, S., Zhu, J. and Hastie, T. (2003). Boosting as a regularized path to a maximum margin classifier. Technical report, Dept. Statistics, Stanford Univ. · Zbl 1222.68290 [47] Zhu, J., Rosset, S., Hastie, T. and Tibshirani, R. (2004). 1-norm support vector machines. Neural Information Processing Systems 16 . · Zbl 1222.68213 [48] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300. · Zbl 0809.62014 [49] Blake, C. and Merz, C. (1998). UCI repository of machine learning databases. Technical report, School Information and Computer Science, Univ. California, Irvine. Available at www.ics.uci.edu/ mlearn/MLRepository.html. [50] Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425–455. · Zbl 0815.62019 · doi:10.1093/biomet/81.3.425 [51] Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947–1975. JSTOR: · Zbl 0829.62066 · doi:10.1214/aos/1176325766 · links.jstor.org [52] Foster, D. P. and Stine, R. A. (1996). Variable selection via information theory. Technical Report Discussion Paper 1180, Center for Mathematical Studies in Economics and Management Science, Northwestern Univ. [53] Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486–494. · Zbl 0773.62051 · doi:10.2307/2290328 [54] Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37 373–384. · Zbl 0862.62059 · doi:10.2307/1269730 [55] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models , 2nd ed. Chapman and Hall, London. · Zbl 0588.62104 [56] Moore, D. S. and McCabe, G. P. (1999). Introduction to the Practice of Statistics , 3rd ed. Freeman, New York. · Zbl 0701.62002 [57] Nelder, J. A. (1977). A reformulation of linear models (with discussion). J. Roy. Statist. Soc. Ser. A 140 48–76. [58] Nelder, J. A. (1994). The statistics of linear models: Back to basics. Statist. Comput. 4 221–234. [59] Cook, R. D. (1998). Regression Graphics . Wiley, New York. · Zbl 0903.62001 [60] Cook, R. D. and Weisberg, S. (1999a). Applied Regression Including Computing and Graphics. Wiley, New York. · Zbl 0928.62045 [61] Cook, R. D. and Weisberg, S. (1999b). Graphs in statistical analysis: Is the medium the message? Amer. Statist. 53 29–37. [62] Efron, B. (2001). Discussion of “Statistical modeling: The two cultures,” by L. Breiman. Statist. Sci. 16 218–219. · Zbl 1059.62505 · doi:10.1214/ss/1009213726 [63] Li, K. C. (1991). Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc. 86 316–342. · Zbl 0742.62044 · doi:10.2307/2290563 [64] Weisberg, S. (1981). A statistic for allocating $$C_p$$ to individual cases. Technometrics 23 27–31. [65] Weisberg, S. (2002). Dimension reduction regression in R. J. Statistical Software 7 . (On-line journal available at www.jstatsoft.org. The software is available from cran.r-project.org.) [66] Abramovich, F., Benjamini, Y., Donoho, D. and Johnstone, I. (2000). Adapting to unknown sparsity by controlling the false discovery rate. Technical Report 2000-19, Dept. Statistics, Stanford Univ. · Zbl 1092.62005 [67] Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. 3 203–268. · Zbl 1037.62001 · doi:10.1007/s100970100031 [68] Efron, B. (2004). The estimation of prediction error: Covariance penalties and cross-validation. J. Amer. Statist. Assoc. · Zbl 1117.62324 · doi:10.1198/016214504000000692 · masetto.asa.catchword.org [69] Foster, D. and Stine, R. (1997). An information theoretic comparison of model selection criteria. Technical report, Dept. Statistics, Univ. Pennsylvania. [70] George, E. I. and Foster, D. P. (2000). Calibration and empirical Bayes variable selection. Biometrika 87 731–747. · Zbl 1029.62008 · doi:10.1093/biomet/87.4.731 [71] Leblanc, M. and Tibshirani, R. (1998). Monotone shrinkage of trees. J. Comput. Graph. Statist. 7 417–433.