Variable selection in model-based clustering and discriminant analysis with a regularization approach. (English) Zbl 1474.62216

Summary: Several methods for variable selection have been proposed in model-based clustering and classification. These make use of backward or forward procedures to define the roles of the variables. Unfortunately, such stepwise procedures are slow and the resulting algorithms inefficient when analyzing large data sets with many variables. In this paper, we propose an alternative regularization approach for variable selection in model-based clustering and classification. In our approach the variables are first ranked using a lasso-like procedure in order to avoid slow stepwise algorithms. Thus, the variable selection methodology of C. Maugis et al. [Comput. Stat. Data Anal. 53, No. 11, 3872–3882 (2009; Zbl 1453.62154)] can be efficiently applied to high-dimensional data sets.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
91C20 Clustering in the social and behavioral sciences


Zbl 1453.62154
Full Text: DOI arXiv


[1] Banfield, JD; Raftery, AE, Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 803-821, (1993) · Zbl 0794.62034
[2] Biernacki, C.; Celeux, G.; Govaert, G., Assessing a mixture model for clustering with the integrated completed likelihood, IEEE Trans Pattern Anal Mach Intell, 22, 719-725, (2000)
[3] Bouveyron, C.; Brunet, C., Discriminative variable selection for clustering with the sparse Fisher-EM algorithm, Comput Stat, 29, 489-513, (2014) · Zbl 1306.65033
[4] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Pattern Recognit, 28, 781-793, (1995)
[5] Celeux, G.; Maugis, C.; Martin-Magniette, ML; Raftery, AE, Comparing model selection and regularization approaches to variable selection in model-based clustering, J Fr Stat Soc, 155, 57-71, (2014) · Zbl 1316.62083
[6] Dempster, AP; Laird, NM; Rubin, DB, Maximum likelihood from incomplete data via the EM algorithm (with discussion), J Roy Stat Soc B, 39, 1-38, (1977) · Zbl 0364.62022
[7] Fraiman, R.; Justel, A.; Svarc, M., Selection of variables for cluster analysis and classification rules, J Am Stat Assoc, 103, 1294-1303, (2008) · Zbl 1205.62077
[8] Friedman, J.; Hastie, T.; Tibshirani, R., Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, 432-441, (2007) · Zbl 1143.62076
[9] Friedman J, Hastie T, Tibshirani R (2014) glasso: graphical lasso—estimation of Gaussian graphical models. https://CRAN.R-project.org/package=glasso. Accessed 22 July 2014
[10] Gagnot, S.; Tamby, JP; Martin-Magniette, ML; Bitton, F.; Taconnat, L.; Balzergue, S.; Aubourg, S.; Renou, JP; Lecharny, A.; Brunaud, V., CATdb: a public access to arabidopsis transcriptome data from the URGV-CATMA platform, Nucleic Acids Res, 36, d986-d990, (2008)
[11] Galimberti, G.; Montanari, A.; Viroli, C., Penalized factor mixture analysis for variable selection in clustered data, Comput Stat Data Anal, 53, 4301-4310, (2009) · Zbl 1453.62094
[12] Kim, S.; Song, DKH; DeSarbo, WS, Model-based segmentation featuring simultaneous segment-level variable selection, J Mark Res, 49, 725-736, (2012)
[13] Law, MH; Figueiredo, MAT; Jain, AK, Simultaneous feature selection and clustering using mixture models, IEEE Trans Pattern Anal Mach Intell, 26, 1154-1166, (2004)
[14] Lebret, R.; Iovleff, S.; Langrognet, F.; Biernacki, C.; Celeux, G.; Govaert, G., Rmixmod: the R package of the model-based unsupervised, supervised and semi-supervised classification mixmod library, J Stat Softw, 67, 241-270, (2015)
[15] Lee, H.; Li, J., Variable selection for clustering by separability based on ridgelines, J Comput Graph Stat, 21, 315-337, (2012)
[16] Maugis, C.; Celeux, G.; Martin-Magniette, M., Variable selection for clustering with Gaussian mixture models, Biometrics, 65, 701-709, (2009) · Zbl 1172.62021
[17] Maugis, C.; Celeux, G.; Martin-Magniette, ML, Variable selection in model-based clustering: a general variable role modeling, Comput Stat Data Anal, 53, 3872-3882, (2009) · Zbl 1453.62154
[18] Maugis, C.; Celeux, G.; Martin-Magniette, ML, Variable selection in model-based discriminant analysis, J Multivar Anal, 102, 1374-1387, (2011) · Zbl 1219.62103
[19] Meinshausen, N.; Bühlmann, P., High-dimensional graphs and variable selection with the Lasso, Ann Stat, 34, 1436-1462, (2006) · Zbl 1113.62082
[20] Murphy, TB; Dean, N.; Raftery, AE, Variable selection and updating in model-based discriminant analysis for high-dimensional data with food authenticity applications, Ann Appl Stat, 4, 396-421, (2010) · Zbl 1189.62105
[21] Nia, VP; Davison, AC, High-dimensional Bayesian clustering with variable selection: the R package bclust, J Stat Softw, 47, 1-22, (2012)
[22] Pan, W.; Shen, X., Penalized model-based clustering with application to variable selection, J Mach Learn Res, 8, 1145-1164, (2007) · Zbl 1222.68279
[23] Raftery, AE; Dean, N., Variable selection for model-based clustering, J Am Stat Assoc, 101, 168-178, (2006) · Zbl 1118.62339
[24] Schwarz, G., Estimating the dimension of a model, Ann Stat, 6, 461-464, (1978) · Zbl 0379.62005
[25] Scrucca L, Raftery AE (2014) clustvarsel: a package implementing variable selection for model-based clustering in R. arXiv:1411.0606
[26] Scrucca, L.; Fop, M.; Murphy, TB; Raftery, AE, mclust 5: clustering, classification and density estimation using Gaussian finite mixture models, R J, 8, 289, (2016)
[27] Sun, W.; Wang, J.; Fang, Y., Regularized k-means clustering of high dimensional data and its asymptotic consistency, Electron J Stat, 6, 148-167, (2012) · Zbl 1335.62109
[28] Tadesse, MG; Sha, N.; Vannucci, M., Bayesian variable selection in clustering high-dimensional data, J Am Stat Assoc, 100, 602-617, (2005) · Zbl 1117.62433
[29] Wang, S.; Zhu, J., Variable selection for model-based high-dimensional clustering and its application to microarray data, Biometrics, 64, 440-448, (2008) · Zbl 1137.62041
[30] Xie, B.; Pan, W.; Shen, X., Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables, Electron J Stat, 2, 168-212, (2008) · Zbl 1135.62055
[31] Zhou, H.; Pan, W.; Shen, X., Penalized model-based clustering with unconstrained covariance matrices, Electron J Stat, 3, 1473-1496, (2009) · Zbl 1326.62143
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.