zbMATH — the first resource for mathematics

Screening and clustering of sparse regressions with finite non-Gaussian mixtures. (English) Zbl 1372.62012
Summary: This article proposes a method to address the problem that can arise when covariates in a regression setting are not Gaussian, which may give rise to approximately mixture-distributed errors, or when a true mixture of regressions produced the data. The method begins with non-Gaussian mixture-based marginal variable screening, followed by fitting a full but relatively smaller mixture regression model to the selected data with help of a new penalization scheme. Under certain regularity conditions, the new screening procedure is shown to possess a sure screening property even when the population is heterogeneous. We further prove that there exists an elbow point in the associated scree plot which results in a consistent estimator of the set of active covariates in the model. By simulations, we demonstrate that the new procedure can substantially improve the performance of the existing procedures in the content of variable screening and data clustering. By applying the proposed procedure to motif data analysis in molecular biology, we demonstrate that the new method holds promise in practice.
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P10 Applications of statistics to biology and medical sciences; meta analysis
Full Text: DOI
[1] Bühmann, Statistics for High-Dimensional Data: Method, Theory and Applications (2010)
[2] Conlon, Integrating regulatory motif discovery and genome-wide expression analysis, Proceedings of the National Academy of Sciences 100 pp 3339– (2003) · doi:10.1073/pnas.0630591100
[3] Fan, Nonparametric independence screening in sparse ultra-high-dimensional additive models, Journal of the American Statistical Association 106 pp 544– (2011) · Zbl 1232.62064 · doi:10.1198/jasa.2011.tm09779
[4] Fan, Challenges of Big Data analysis, National Science Review 1 pp 293– (2014) · doi:10.1093/nsr/nwt032
[5] Fan, Sure independence screening in generalized linear models with NP-dimensionality, Annals of Statistics 38 pp 3567– (2010) · Zbl 1206.68157 · doi:10.1214/10-AOS798
[6] Fan, Sure independence screening for ultra-high dimensional feature space (with discussion), Journal of the Royal Statistical Society, Series B 70 pp 849– (2008) · doi:10.1111/j.1467-9868.2008.00674.x
[7] Gupta, Variable selection in regression mixture modeling for the discovery of gene regulatory networks, Journal of the American Statistical Association 102 pp 867– (2007) · Zbl 05564417 · doi:10.1198/016214507000000068
[8] Khalili, Feature selection in finite mixture of sparse normal linear models in high-dimensional feature space, Biostatistics 12 pp 156– (2011) · doi:10.1093/biostatistics/kxq048
[9] McLachlan , G. Peel , D. 2000
[10] Städler, l1’ penalization for mixture regression models, Test 19 pp 209– (2010) · Zbl 1203.62128 · doi:10.1007/s11749-010-0197-z
[11] Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 58 pp 267– (1996) · Zbl 0850.62538
[12] Zhang, A Bayesian model for biclustering with applications, The Journal of the Royal Statistical Society, Series C (Applied Statistics) 59 pp 635– (2010)
[13] Zhang, Robust clustering using exponential power mixtures, Biometrics 66 pp 1078– (2010) · Zbl 1233.62192 · doi:10.1111/j.1541-0420.2010.01389.x
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.