×

zbMATH — the first resource for mathematics

Feature screening via distance correlation learning. (English) Zbl 1443.62184
Summary: This article is concerned with screening features in ultrahigh-dimensional data analysis, which has become increasingly important in diverse scientific fields. We develop a sure independence screening procedure based on the distance correlation (DC-SIS). The DC-SIS can be implemented as easily as the sure independence screening (SIS) procedure based on the Pearson correlation proposed by Fan and Lv. However, the DC-SIS can significantly improve the SIS. J. Fan and J. Lv [J. R. Stat. Soc., Ser. B, Stat. Methodol. 70, No. 5, 849–911 (2008; Zbl 1411.62187)] established the sure screening property for the SIS based on linear models, but the sure screening property is valid for the DC-SIS under more general settings, including linear models. Furthermore, the implementation of the DC-SIS does not require model specification (e.g., linear model or generalized linear model) for responses or predictors. This is a very appealing property in ultrahigh-dimensional data analysis. Moreover, the DC-SIS can be used directly to screen grouped predictor variables and multivariate response variables. We establish the sure screening property for the DC-SIS, and conduct simulations to examine its finite sample performance. A numerical comparison indicates that the DC-SIS performs much better than the SIS in various models. We also illustrate the DC-SIS through a real-data example.

MSC:
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H20 Measures of association (correlation, canonical correlation, etc.)
62P10 Applications of statistics to biology and medical sciences; meta analysis
62J07 Ridge regression; shrinkage estimators (Lasso)
62J12 Generalized linear models (logistic models)
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Ashburner M., Nature Genetics 25 pp 25– (2000) · doi:10.1038/75556
[2] Bild A., Nature 439 pp 353– (2006) · doi:10.1038/nature04296
[3] Candes E., The Annals of Statistics 35 pp 2313– (2007) · Zbl 1139.62019 · doi:10.1214/009053606000001523
[4] Chen L. S., Journal of the American Statistical Association 106 pp 1345– (2011) · Zbl 1234.62082 · doi:10.1198/jasa.2011.ap10599
[5] Efron B., The Annals of Statistics 32 pp 409– (2004)
[6] Efron B., The Annals of Applied Statistics 1 pp 107– (2007) · Zbl 1129.62102 · doi:10.1214/07-AOAS101
[7] Fan J., Journal of the American Statistical Association 106 pp 544– (2011) · Zbl 1232.62064 · doi:10.1198/jasa.2011.tm09779
[8] Fan J., Journal of the American Statistical Association 96 pp 1348– (2001) · Zbl 1073.62547 · doi:10.1198/016214501753382273
[9] Fan J., Journal of the Royal Statistical Society, Series B 70 pp 849– (2008) · doi:10.1111/j.1467-9868.2008.00674.x
[10] Fan J., Journal of Machine Learning Research 10 pp 1829– (2009)
[11] Fan J., The Annals of Statistics 38 pp 3567– (2010) · Zbl 1206.68157 · doi:10.1214/10-AOS798
[12] Hall P., Journal of Computational and Graphical Statistics 18 pp 533– (2009) · doi:10.1198/jcgs.2009.08041
[13] Ji P., The Annals of Statistics 40 pp 73– (2012) · Zbl 1246.62160 · doi:10.1214/11-AOS947
[14] Jones S., Science 321 pp 1801– (2008) · doi:10.1126/science.1164368
[15] Kim Y., Journal of the American Statistical Association 103 pp 1665– (2008) · Zbl 1286.62062 · doi:10.1198/016214508000001066
[16] Mootha V. K., Nature Genetics 34 pp 267– (2003) · doi:10.1038/ng1180
[17] Segal M. R., Journal of Computational Biology 10 pp 961– (2003) · doi:10.1089/106652703322756177
[18] Serfling R. J., Approximation Theorems of Mathematical Statistics (1980) · Zbl 0538.62002 · doi:10.1002/9780470316481
[19] Subramanian A., Proceedings of the National Academy of Sciences of the USA 102 pp 15545– (2005) · doi:10.1073/pnas.0506580102
[20] Székely G. J., The Annals of Applied Statistics 3 pp 1233– (2009)
[21] Székely G. J., The Annals of Statistics 35 pp 2769– (2007) · Zbl 1129.62059 · doi:10.1214/009053607000000505
[22] Tian L., Proceedings of the National Academy of Sciences of the USA 102 pp 13544– (2005) · doi:10.1073/pnas.0506577102
[23] Tibshirani R., Journal of the Royal Statistical Society, Series B 58 pp 267– (1996)
[24] Wang H., Journal of the American Statistical Association 104 pp 1512– (2009) · Zbl 1205.62103 · doi:10.1198/jasa.2008.tm08516
[25] Zhao S. D., Journal of Multivariate Analysis 105 pp 397– (2012) · Zbl 1233.62173 · doi:10.1016/j.jmva.2011.08.002
[26] Zhu L. P., Journal of the American Statistical Association 106 pp 1464– (2011) · Zbl 1233.62195 · doi:10.1198/jasa.2011.tm10563
[27] Zou H., Journal of the American Statistical Association 101 pp 1418– (2006) · Zbl 1171.62326 · doi:10.1198/016214506000000735
[28] Zou H., Journal of the Royal Statistical Society, Series B 67 pp 301– (2005) · Zbl 1069.62054 · doi:10.1111/j.1467-9868.2005.00503.x
[29] Zou H., The Annals of Statistics 36 pp 1509– (2008)
[30] Zou H., The Annals of Statistics 37 pp 1733– (2009) · Zbl 1168.62064 · doi:10.1214/08-AOS625
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.