×

Variable selection for clustering with Gaussian mixture models. (English) Zbl 1172.62021

Summary: This article is concerned with variable selection for cluster analysis. The problem is regarded as a model selection problem in the model-based cluster analysis context. A model generalizing the model of A. E. Raftery and N. Dean [J. Am. Stat. Assoc. 101, No. 473, 168–178 (2006; Zbl 1118.62339)] is proposed to specify the role of each variable. This model does not need any prior assumptions about the linear link between the selected and discarded variables. The models are compared with the Bayesian information criterion. Variable role is obtained through an algorithm embedding two backward stepwise algorithms for variable selection for clustering and linear regression. The model identifiability is established and the consistency of the resulting criterion is proved under regularity conditions. Numerical experiments on simulated data sets and a genomic application highlight the interest of the procedure.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62J05 Linear regression; mixed models
65C60 Computational problems in statistics (MSC2010)
62P10 Applications of statistics to biology and medical sciences; meta analysis

Citations:

Zbl 1118.62339

Software:

mclust; Mixmod; UCI-ml
PDF BibTeX XML Cite
Full Text: DOI Link

References:

[1] Anderson, An Introduction to Multivariate Statistical Analysis (2003) · Zbl 1039.62044
[2] Banfield, Model-based Gaussian and non-Gaussian clustering, Biometrics 49 pp 803– (1993) · Zbl 0794.62034
[3] Biernacki, Model-based cluster and discriminant analysis with the mixmod software, Computational Statistics and Data Analysis 51 pp 587– (2006) · Zbl 1157.62431
[4] Blake , C. Keogh , E. Merz , C. 1999 UCI Repository of Machine Learning Algorithms Databases http://www.ics.uci.edu/ mlearnMLRepository.html
[5] Bouveyron, High-dimensional data clustering, Computational Statistics and Data Analysis 52 pp 502– (2007) · Zbl 1452.62433
[6] Breiman, Classification and Regression Trees (1984)
[7] Brusco, A variable selection heuristic for k-means clustering, Psychometrika 66 pp 249– (2001) · Zbl 1293.62237
[8] Celeux, Gaussian parsimonious clustering models, Pattern Recognition 28 pp 781– (1995) · Zbl 05480211
[9] Dash, Proceedings of the Second IEEE International Conference on Data Mining pp 115– (2002)
[10] Dempster, Maximum likelihood from incomplete data via the EM algorithm (with discussion), Journal of the Royal Statistical Society, Series B 39 pp 1– (1977) · Zbl 0364.62022
[11] Devaney, Machine Learning: Proceedings of the Fourteenth International Conference pp 92– (1997)
[12] Fowlkes, Variable selection in clustering, Journal of Classification 5 pp 205– (1988)
[13] Fraley, Enhanced software for model-based clustering, density estimation, and discriminant analysis: mclust, Journal of Classification 20 pp 263– (2003) · Zbl 1055.62071
[14] Friedman, Clustering objects on subsets of attributes (with discussion), Journal of the Royal Statistical Society, Series B 66 pp 815– (2004) · Zbl 1060.62064
[15] Gagnot, CATdb: A public access to Arabidopsis transcriptome data from the URGV-CATMA platform, Nucleic Acids Research 36 pp 986– (2008) · Zbl 05438441
[16] Guyon, An introduction to variable and feature selection, Journal of Machine Learning Research 3 pp 1157– (2003) · Zbl 1102.68556
[17] Jammes, Genome-wide expression profiling of the host response to root-knot nematode infection in Arabidopsis, The Plant Journal 44 pp 447– (2005)
[18] Jiang, Cluster analysis for gene expression data: A survey, IEEE Transactions on Knowledge and Data Engineering 16 pp 1370– (2004) · Zbl 05110054
[19] Jouve, Proceedings of International Symposium on Methodologies for Intelligent Systems pp 583– (2005)
[20] Kass, Bayes factors, Journal of the American Statistical Association 90 pp 773– (1995) · Zbl 0846.62028
[21] Kim, Variable selection in clustering via Dirichlet process mixture models, Biometrika 93 pp 877– (2006) · Zbl 1436.62266
[22] Kohavi, Wrappers for feature subset selection, Artificial Intelligence 97 pp 273– (1997) · Zbl 0904.68143
[23] Law, Simultaneous feature selection and clustering using mixture models, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 pp 1154– (2004) · Zbl 05112235
[24] Maugis, Variable selection for clustering with Gaussian mixture models (2007)
[25] McLachlan, Finite Mixture Models (2000) · Zbl 0963.62061
[26] McLachlan, A mixture model-based approach to the clustering of microarray expression data, Bioinformatics 18 pp 413– (2002)
[27] Miller, Subset Selection in Regression (1990) · Zbl 0702.62057
[28] Raftery, Variable selection for model-based clustering, Journal of the American Statistical Association 101 pp 168– (2006) · Zbl 1118.62339
[29] Schwarz, Estimating the dimension of a model, The Annals of Statistics 6 pp 461– (1978) · Zbl 0379.62005
[30] Sharan, Ernst Schering Workshop on Bioinformatics and Genome Analysis (2002)
[31] Tadesse, Bayesian variable selection in clustering high-dimensional data, Journal of the American Statistical Association 100 pp 602– (2005) · Zbl 1117.62433
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.