×

Robust rank correlation based screening. (English) Zbl 1257.62067

Summary: Independence screening is a variable selection method that uses a ranking criterion to select significant variables, particularly for statistical models with nonpolynomial dimensionality or “large \(p\), small \(n\)” paradigms when \(p\) can be as large as an exponential of the sample size \(n\). We propose a robust rank correlation screening (RRCS) method to deal with ultra-high dimensional data. The new procedure is based on the Kendall \(\tau\) correlation coefficient between response and predictor variables rather than the Pearson correlation of existing methods. The new method has four desirable features compared with existing independence screening methods.
First, the sure independence screening property can hold only under the existence of a second order moment of predictor variables, rather than exponential tails or alikeness, even when the number of predictor variables grows as fast as exponentially of the sample size. Second, it can be used to deal with semiparametric models such as transformation regression models and single-index models under monotonic constraint to the link function without involving nonparametric estimation even when there are nonparametric functions in the models. Third, the procedure can be largely used against outliers and influence points in the observations. Last, the use of indicator functions in rank correlation screening greatly simplifies the theoretical derivation due to the boundedness of the resulting statistics, compared with previous studies on variable screening. Simulations are carried out for comparisons with existing methods and a real data example is analyzed.

MSC:

62H20 Measures of association (correlation, canonical correlation, etc.)
62G08 Nonparametric regression and quantile regression
62G35 Nonparametric robustness
62J12 Generalized linear models (logistic models)
62F35 Robustness and adaptive procedures (parametric inference)

Software:

Excel
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Albright, S. C., Winston, W. L. and Zappe, C. J. (1999). Data Analysis and Decision Making with Microsoft Excel . Duxbury, Pacific Grove, CA.
[2] Bickel, P. J. and Doksum, K. A. (1981). An analysis of transformations revisited. J. Amer. Statist. Assoc. 76 296-311. · Zbl 0464.62058 · doi:10.2307/2287831
[3] Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 26 211-252. · Zbl 0156.40104
[4] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when \(p\) is much larger than \(n\). Ann. Statist. 35 2313-2351. · Zbl 1139.62019 · doi:10.1214/009053606000001523
[5] Cario, M. C. and Nelson, B. L. (1997). Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Technical report, Dept. Industrial Engineering and Management Sciences, Northwestern Univ., Evanston, IL.
[6] Carroll, R. J. and Ruppert, D. (1988). Transformation and Weighting in Regression . Chapman & Hall, New York. · Zbl 0666.62062
[7] Channouf, N. and L’Ecuyer, P. (2009). Fitting a normal copula for a multivariate distribution with both discrete and continuous marginals. In Proceedings of the 2009 Winter Simulation Conference 352-358.
[8] Cook, R. D. and Weisberg, S. (1991). Discussion with “Sliced inverse regression for dimension reduction,” by K. C. Li. J. Amer. Statist. Assoc. 86 328-332. · Zbl 0742.62044 · doi:10.2307/2290563
[9] Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of dimensionality. In Aide-Memoire of a Lecture at AMS Conference on Math Challenges of 21 st Century .
[10] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407-451. · Zbl 1091.62054 · doi:10.1214/009053604000000067
[11] Fan, J., Feng, Y. and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Amer. Statist. Assoc. 106 544-557. · Zbl 1232.62064 · doi:10.1198/jasa.2011.tm09779
[12] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. · Zbl 1073.62547 · doi:10.1198/016214501753382273
[13] Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In International Congress of Mathematicians. Vol. III (M. Sanz-Sole, J. Soria, J. L. Varona and J. Verdera, eds.) 595-622. Eur. Math. Soc., Zürich. · Zbl 1117.62137
[14] Fan, J. and Lv, J. (2008). Sure independence screening for ultra-high dimensional feature space (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849-911. · doi:10.1111/j.1467-9868.2008.00674.x
[15] Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101-148. · Zbl 1180.62080
[16] Fan, J. and Lv, J. (2011). Non-concave penalized likelihood with NP-dimensionality. IEEE Trans. Inform. Theory 57 5467-5484. · Zbl 1365.62277 · doi:10.1109/TIT.2011.2158486
[17] Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 32 928-961. · Zbl 1092.62031 · doi:10.1214/009053604000000256
[18] Fan, J., Samworth, R. and Wu, Y. (2009). Ultrahigh dimensional variable selection: Beyond the lienar model. J. Mach. Learn. Res. 10 1829-1853. · Zbl 1235.62089
[19] Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Ann. Statist. 38 3567-3604. · Zbl 1206.68157 · doi:10.1214/10-AOS798
[20] Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics 35 109-148. · Zbl 0775.62288 · doi:10.2307/1269656
[21] Ghosh, S. and Henderson, S. G. (2003). Behavior of the NORTA method for correlated random vector generation as the dimension increases. ACM Transactions on Modeling and Computer Simulation 13 276-294. · Zbl 1390.65009
[22] Hall, P. and Miller, H. (2009). Using generalized correlation to effect variable selection in very high dimensional problems. J. Comput. Graph. Statist. 18 533-550. · doi:10.1198/jcgs.2009.08041
[23] Han, A. K. (1987). Nonparametric analysis of a generalized regression model. The maximum rank correlation estimator. J. Econometrics 35 303-316. · Zbl 0638.62063 · doi:10.1016/0304-4076(87)90030-3
[24] Huang, J., Horowitz, J. L. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 36 587-613. · Zbl 1133.62048 · doi:10.1214/009053607000000875
[25] Huber, P. J. and Ronchetti, E. M. (2009). Robust Statistics , 2nd ed. Wiley, Hoboken, NJ. · Zbl 1276.62022
[26] Kendall, M. G. (1938). A new measure of rank correlation. Biometrika 30 81-93. · Zbl 0019.13001 · doi:10.1093/biomet/30.1-2.81
[27] Kendall, M. G. (1949). Rank and product-moment correlation. Biometrika 36 177-193. · Zbl 0035.21602 · doi:10.1093/biomet/36.1-2.177
[28] Kendall, M. G. (1962). Rank Correlation Methods , 3rd ed. Griffin & Co, London. · Zbl 0032.17602
[29] Klaassen, C. A. J. and Wellner, J. A. (1997). Efficient estimation in the bivariate normal copula model: Normal margins are least favourable. Bernoulli 3 55-77. · Zbl 0877.62055 · doi:10.2307/3318652
[30] Li, K.-C. (1991). Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc. 86 316-342. · Zbl 0742.62044 · doi:10.2307/2290563
[31] Li, G., Peng, H. and Zhu, L. (2011). Nonconcave penalized \(M\)-estimation with a diverging number of parameters. Statist. Sinica 21 391-419. · Zbl 1206.62036
[32] Li, G. R., Peng, H., Zhang, J. and Zhu, L. X. (2012). Supplement to “Robust rank correlation based screening.” . · Zbl 1257.62067
[33] Lin, H. and Peng, H. (2013). Smoothed rank correlation of the linear transformation regression model. Comput. Statist. Data Anal. 57 615-630. · Zbl 1365.62280
[34] Lv, J. and Fan, Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares. Ann. Statist. 37 3498-3528. · Zbl 1369.62156 · doi:10.1214/09-AOS683
[35] Nelsen, R. B. (2006). An Introduction to Copulas , 2nd ed. Springer, New York. · Zbl 1152.62030
[36] Pitt, M., Chan, D. and Kohn, R. (2006). Efficient Bayesian inference for Gaussian copula regression models. Biometrika 93 537-554. · Zbl 1108.62027 · doi:10.1093/biomet/93.3.537
[37] Sen, P. K. (1968). Estimates of the regression coefficient based on Kendall’s tau. J. Amer. Statist. Assoc. 63 1379-1389. · Zbl 0167.47202 · doi:10.2307/2285891
[38] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 267-288. · Zbl 0850.62538
[39] van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614-645. · Zbl 1138.62323 · doi:10.1214/009053607000000929
[40] Wackerly, D. D., Mendenhall, W. and Scheaffer, R. L. (2002). Mathematical Statistics with Applications . Duxbury, Pacific Grove, CA. · Zbl 0681.62001
[41] Wang, H. (2012). Factor profiled sure independence screening. Biometrika 99 15-28. · Zbl 1234.62108 · doi:10.1093/biomet/asr074
[42] Xu, P. R. and Zhu, L. X. (2010). Sure independence screening for marginal longitudinal generalized linear models. Unpublished manuscript.
[43] Zhu, L. P., Li, L. X., Li, R. Z. and Zhu, L. X. (2011). Model-free feature screening for ultrahigh-demensional data. J. Amer. Statist. Assoc. 106 1464-1474. · Zbl 1233.62195 · doi:10.1198/jasa.2011.tm10563
[44] Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418-1429. · Zbl 1171.62326 · doi:10.1198/016214506000000735
[45] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301-320. · Zbl 1069.62054 · doi:10.1111/j.1467-9868.2005.00503.x
[46] Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models (with discussion). Ann. Statist. 36 1509-1566. · Zbl 1282.62112 · doi:10.1214/009053607000000802
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.