zbMATH — the first resource for mathematics

Persistene in high-dimensional linear predictor-selection and the virtue of overparametrization. (English) Zbl 1055.62078
Summary: Let \(Z^i=(Y^i,X_1^i,\dots,X^i_m)\), \(i=1,\dots,n\), be independent and identically distributed random vectors, \(Z^i\sim F\), \(F\in {\mathcal F}\). It is desired to predict \(Y\) by \(\sum\beta_jX_j\), where \((\beta_1,\dots,\beta_m)\in B^n\subseteq\mathbb{R}^m\), under a prediction loss. Suppose that \(m=n^\alpha\), \(\alpha>1\), that is, there are many more explanatory variables than observations. We consider sets \(B^n\) restricted by the maximal number of non-zero coefficients of their members, or by their \(I_1\) radius. We study the following asymptotic question: how ‘large’ may the set \(B^n\) be, so that it is still possible to select empirically a predictor whose risk under \(F\) is close to that of the best predictor in the set?
Sharp bounds for orders of magnitudes are given under various assumptions on \({\mathcal F}\). Algorithmic complexity of the ensuing procedures is also studied. The main message of this paper and the implications of the orders derived are that under various sparsity assumptions on the optimal predictor there is ‘asymptotically no harm’ in introducing many more explanatory variables than observations. Furthermore, such practice can be beneficial in comparison with a procedure that screens in advance a small subset of explanatory variables. Another main result is that ‘lasso’ procedures, that is, optimization under \(l_1\) constraints, could be efficient in finding optimal sparse predictors in high dimensions.

62H99 Multivariate analysis
62A01 Foundations and philosophical topics in statistics
62F30 Parametric inference under constraints
Full Text: DOI
[1] Bickel, P. and Levina, E. (2004) Some theory for Fisherś linear discriminant function, ’naive Bayes\', and some alternatives where there are more variables than observations. Bernoulli, 10, 989-1010. · Zbl 1064.62073 · doi:10.3150/bj/1106314847
[2] Billingsley, P. (1995) Probability and Measure. (3rd edition). New York: Wiley. · Zbl 0822.60002
[3] Breiman, L. (2001) Statistical modeling: the two cultures. Statist. Sci., 16, 199-231. Abstract can also be found in the ISI/STMA publication URL: · Zbl 1059.62505 · doi:10.1214/ss/1009213726 · isi.netserv.nl
[4] Breiman, L. and Freedman, D. (1983) How many variables should be entered in a regression equation? J. Amer. Statist. Assoc., 78, 131-136. JSTOR: · Zbl 0513.62068 · doi:10.2307/2287119 · links.jstor.org
[5] Chen, S., Donoho, D. and Saunders, M. (2001) Atomic decomposition by basis pursuit. SIAM Rev., 43, 129-159. JSTOR: · Zbl 0979.94010 · doi:10.1137/S003614450037906X · links.jstor.org
[6] Donoho, D.L. and Johnstone, I.M. (1994) Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81, 425-455. Abstract can also be found in the ISI/STMA publication URL: JSTOR: links.jstor.org · Zbl 0815.62019 · doi:10.1093/biomet/81.3.425 · isi.netserv.nl
[7] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004) Least angle regression. Ann. Statist., 32, 407-499. Abstract can also be found in the ISI/STMA publication URL: · Zbl 1091.62054 · doi:10.1214/009053604000000067 · isi.netserv.nl
[8] Emery, M., Nemirovski, A. and Voiculescu, D. (2000) Lectures on Probability Theory and Statistics. École d\' Été de Probabilités de Saint-Flour XXVIII - 1998 (P. Bernard, ed.), Lecture Notes in Math, 1738. Berlin: Springer-Verlag.
[9] Foster, D.P. and George, E.L. (1994) The risk inflation criterion for multiple regression. Ann. Statist., 22, 1947-1975. Abstract can also be found in the ISI/STMA publication URL: · Zbl 0829.62066 · doi:10.1214/aos/1176325766 · isi.netserv.nl
[10] Huber, P. (1973) Robust regression: asymptotics, conjectures, and Monte Carlo. Ann. Statist., 1, 799-821. · Zbl 0289.62033 · doi:10.1214/aos/1176342503
[11] Juditsky, A. and Nemirovski, A. (2000) Functional aggregation for nonparametric regression. Ann. Statist., 28, 681-712. Abstract can also be found in the ISI/STMA publication URL: · Zbl 1105.62338 · doi:10.1214/aos/1015951994 · euclid:aos/1015951994
[12] Le Cam, L. and Yang, G.L. (1990) Asymptotics in Statistics. New York: Springer-Verlag. · Zbl 0719.62003
[13] Lee, W.S., Bartlett, P.L. and Williamson, R.C. (1996) Efficient agnostics learning of neural networks with bounded fan in. IEEE Trans. Inform. Theory, 42, 2118-2132. · Zbl 0874.68253 · doi:10.1109/18.556601
[14] Nemirovski, A. and Yudin, D. (1983) Problem Complexity and Method Efficiency in Optimization. Chichester: Wiley. · Zbl 0501.90062
[15] Portnoy, S. (1984) Asymptotic behavior of M-estimators of p regression parameters when p2=n is large, I. Consistency. Ann. Statist., 12, 1298-1309. · Zbl 0584.62050 · doi:10.1214/aos/1176346793
[16] Silverstein, J.W. (1985) The smallest eigen value of a large dimensional Wishart matrix. Ann. Probab., 13, 1364-1368. · Zbl 0591.60025 · doi:10.1214/aop/1176992819
[17] Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58, 267-288. Abstract can also be found in the ISI/STMA publication URL: JSTOR: links.jstor.org · Zbl 0850.62538 · isi.netserv.nl
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.