×

Covariance-insured screening. (English) Zbl 1507.62073

Summary: Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors far greater than the sample size. In order to identify more novel biomarkers and understand biological mechanisms, it is vital to detect signals weakly associated with outcomes among ultrahigh-dimensional predictors. However, existing screening methods, which typically ignore correlation information, are likely to miss weak signals. By incorporating the inter-feature dependence, a covariance-insured screening approach is proposed to identify predictors that are jointly informative but marginally weakly associated with outcomes. The validity of the method is examined via extensive simulations and a real data study for selecting potential genetic factors related to the onset of multiple myeloma.

MSC:

62-08 Computational methods for problems pertaining to statistics
62P10 Applications of statistics to biology and medical sciences; meta analysis
PDFBibTeX XMLCite
Full Text: DOI arXiv Link

References:

[1] Berisa, T.; Pickrell, J., Approximately independent linkage disequilibrium blocks in human populations, Bioinformatics, 32, 2, 283-285 (2016)
[2] Bickel, P.; Levina, E., Covariance regularization by thresholding, Ann. Statist., 36, 6, 2577-2604 (2008) · Zbl 1196.62062
[3] Bühlmann, P.; Kalisch, M.; Maathuis, M., Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm, Biometrika, 97, 2, 261-278 (2010) · Zbl 1233.62135
[4] Bühlmann, P.; van de Geer, S., Statistics for High-Dimensional Data: Methods, Theory and Applications (2011), Springer-Verlag: Springer-Verlag Berlin Heidelberg · Zbl 1273.62015
[5] Bunney, T.; Baxendale, R.; Katan, M., Regulatory links between plc enzymes and ras superfamily gtpases: signalling via plcepsilon, Adv. Enzyme Regul., 49, 54-58 (2009)
[6] Chapman, M. A.; Lawrence, M. S.; Keats, J. J.; Cibulskis, K.; Sougnez, C.; Schinzel, A. C.; Golub, T. R., Initial genome sequencing and analysis of multiple myeloma, Nature, 471, 7339, 467-472 (2011)
[7] Cho, H.; Fryzlewicz, P., High dimensional variable selection via tilting, J. R. Stat. Soc. Ser. B Stat. Methodol., 74, 3, 593-622 (2012) · Zbl 1411.62183
[8] Consortium, M., The MAQC-II project: A comprehensive study of common practices for the development and validation of microarray-based predictive models, Nature Biotechnol., 28, 827-838 (2010)
[9] Csardi, G.; Nepusz, T., The igraph software package for complex network research, InterJ. Complex Syst., 1695, 6, 1-9 (2006)
[10] Efron, B., (Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Institute of Mathematical Statistics Monographs (2012), Cambridge University Press) · Zbl 1256.62007
[11] Even, S., Graph Algorithms (2011), Cambridge University Press: Cambridge University Press Cambridge
[12] Fan, J.; Feng, Y.; Song, R., Nonparametric independence screening in sparse ultra-high-dimensional additive models, J. Amer. Statist. Assoc., 106, 494, 544-557 (2011) · Zbl 1232.62064
[13] Fan, J.; Li, R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc., 96, 456, 1348-1360 (2001) · Zbl 1073.62547
[14] Fan, J.; Lv, J., Sure independence screening for ultrahigh dimensional feature space with discussion, J. R. Stat. Soc. Ser. B Stat. Methodol., 70, 5, 849-911 (2008) · Zbl 1411.62187
[15] Fan, J.; Song, R., Sure independence screening in generalized linear models and NP-dimensionality, Ann. Statist., 38, 6, 3567-3604 (2010) · Zbl 1206.68157
[16] He, K.; Li, Y.; Zhu, J.; Liu, H.; Lee, J. E.; Amos, C. I.; Hyslop, T.; Jin, J.; Lin, H.; Wei, Q.; Li, Y., Component-wise gradient boosting and false discovery control in survival analysis with high-dimensional covariates, Bioinformatics, 32, 1, 50-57 (2016)
[17] He, X.; Wang, L.; Hong, H. G., Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data, Ann. Statist., 41, 1, 342-369 (2013) · Zbl 1295.62053
[18] Jin, J.; Zhang, C. H.; Zhang, Q., Optimality of graphlet screening in high dimensional variable selection, J. Mach. Learn. Res., 15, 2723-2772 (2014) · Zbl 1319.62139
[19] Kim, S., ppcor: An R package for a fast calculation to semi-partial correlation coefficients, Commun. Stat. Appl. Methods, 22, 6, 665-674 (2015)
[20] Kyle, R.; Rajkuma, S., Multiple myeloma, Blood, 111, 2962-2972 (2008)
[21] Li, G.; Peng, H.; Zhang, J.; Zhu, L., Robust rank correlation based screening, Ann. Statist., 40, 1846-1877 (2012) · Zbl 1257.62067
[22] Li, M.; Yang, X.; Zhang, J.; Shi, H.; Hang, Q.; Huang, X.; Wang, H., Effects of ehd2 interference on migration of esophageal squamous cell carcinoma, Med. Oncol., 30, 1, 396 (2013)
[23] Mirabella, F.; Wu, P.; Wardell, C.; Kaiser, M.; Walker, B.; Johnson, D.; Morgan, G., MMSET is the key molecular target in t(4;14) myeloma, Blood Cancer J., 3, e114 (2013)
[24] Nikesitch, N.; Tao, C.; Lai, K.; Killingsworth, M.; Bae, S.; Wang, M.; Ling, S. C.W., Predicting the response of multiple myeloma to the proteasome inhibitor Bortezomib by evaluation of the unfolded protein response, Blood Cancer J., 6, e432 (2016)
[25] Noll, J.; Vandyke, K.; Hewett, D.; Mrozik, K.; Bala, R.; Williams, S.; Zannettino, A., PTTG1 expression is associated with hyperproliferative disease and poor prognosis in multiple myeloma, J. Hematol. Oncol., 8, 106 (2015)
[26] Peng, J.; Wang, P.; Zhou, N.; Zhu, J., Partial correlation estimation by joint sparse regression models, J. Amer. Statist. Assoc., 104, 486, 735-746 (2009) · Zbl 1388.62046
[27] Rhee, S., Regulation of phosphoinositide-specific phospholipase c, Annu. Rev. Biochem., 70, 281-312 (2001)
[28] Rothman, A.; Levina, E.; Zhu, J., Generalized thresholding of large covariance matrices, J. Amer. Statist. Assoc., 104, 485, 177-186 (2009) · Zbl 1388.62170
[29] Shaughnessy, J.; Zhan, F.; Burington, B.; Huang, Y.; Colla, S.; Hanamura, I.; Stewart, J.; Kordsmeier, B.; Randolph, C.; Williams, D.; Xiao, Y.; Xu, H.; Epstein, J.; Anaissie, E.; Krishna, S.; Cottler-Fox, M.; Hollmig, K.; Mohiuddin, A.; Pineda-Roman, M.; Tricot, G.; van Rhee, F.; Sawyer, J.; Alsayed, Y.; Walker, R.; Zangari, M.; Crowley, J.; Barlogie, B., A validated gene expression model of high-risk multiple myeloma is defined by deregulated expression of genes mapping to chromosome 1, Blood, 109, 2276-2284 (2007)
[30] Sun, S.; Hood, M.; Scott, L.; Peng, Q.; Mukherjee, S.; Tung, J.; Zhou, X., Differential expression analysis for RNAseq using Poisson mixed models, Nucleic Acids Res., 45, 11, e106 (2017)
[31] Wang, H., Forward regression for ultra-high dimensional variable screening, J. Amer. Statist. Assoc., 104, 488, 1512-1524 (2009) · Zbl 1205.62103
[32] Wang, X.; Leng, C., High dimensional ordinary least squares projection for screening variables, J. Roy. Statist. Soc.: Ser. B, 78, 3, 589-611 (2016) · Zbl 1414.62313
[33] Whittaker, J., Graphical Models in Applied Multivariate Statistics, Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics (1990) · Zbl 0732.62056
[34] Zhang, B.; Wang, D.; Wu, J.; Tang, J.; Chen, W.; Chen, X.; Zhang, D.; Deng, Y.; Guo, M.; Wang, Y.; Luo, J.; Chen, R., Expression profiling and functional prediction of long noncoding RNAs in nasopharyngeal nonkeratinizing carcinoma, Discov. Med., 21, 116, 239-250 (2016)
[35] Zhao, D. S.; Li, Y., Principled sure independence screening for Cox models with ultra-high-dimensional covariates, J. Multivariate Anal., 105, 1, 397-411 (2012) · Zbl 1233.62173
[36] Zhao, D. S.; Li, Y., Score test variable screening, Biometrics, 70, 4, 862-871 (2014) · Zbl 1393.62116
[37] Zhu, L.; Li, L.; Li, R.; Zhu, L., Model-free feature screening for ultrahigh-dimensional data, J. Amer. Statist. Assoc., 106, 496, 1464-1475 (2011) · Zbl 1233.62195
[38] Zou, H., The adaptive Lasso and its oracle properties, J. Amer. Statist. Assoc., 101, 476, 1418-1429 (2006) · Zbl 1171.62326
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.