×

Modelling the role of variables in model-based cluster analysis. (English) Zbl 1384.62195

Summary: In the framework of cluster analysis based on Gaussian mixture models, it is usually assumed that all the variables provide information about the clustering of the sample units. Several variable selection procedures are available in order to detect the structure of interest for the clustering when this structure is contained in a variable sub-vector. Currently, in these procedures a variable is assumed to play one of (up to) three roles: (1) informative, (2) uninformative and correlated with some informative variables, (3) uninformative and uncorrelated with any informative variable. A more general approach for modelling the role of a variable is proposed by taking into account the possibility that the variable vector provides information about more than one structure of interest for the clustering. This approach is developed by assuming that such information is given by non-overlapped and possibly correlated sub-vectors of variables; it is also assumed that the model for the variable vector is equal to a product of conditionally independent Gaussian mixture models (one for each variable sub-vector). Details about model identifiability, parameter estimation and model selection are provided. The usefulness and effectiveness of the described methodology are illustrated using simulated and real datasets.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62J05 Linear regression; mixed models
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Anderson, T.: An Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley, New York (2003) · Zbl 1039.62044
[2] Andrews, J.L., McNicholas, P.D.: Variable selection for clustering and classification. J. Classif. 31, 136-153 (2014) · Zbl 1360.62310 · doi:10.1007/s00357-013-9139-2
[3] Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803-821 (1993) · Zbl 0794.62034 · doi:10.2307/2532201
[4] Belitskaya-Levy, I.: A generalized clustering problem, with application to DNA microarrays. Stat. Appl. Genet. Mol. Biol. 5, Article 2 (2006) · Zbl 1166.62331
[5] Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719-725 (2000) · doi:10.1109/34.865189
[6] Biernacki, C., Govaert, G.: Choosing models in model-based clustering and discriminant analysis. J. Stat. Comput. Simul. 64, 49-71 (1999) · Zbl 1156.62335 · doi:10.1080/00949659908811966
[7] Bozdogan, H.; Bozdogan, H. (ed.), Intelligent statistical data mining with information complexity and genetic algorithms, 15-56 (2004), London
[8] Browne, R.P., ElSherbiny, A., McNicholas, P.D.: mixture: mixture models for clustering and classification. R package version 1.4 (2015) · Zbl 1146.62101
[9] Brusco, M.J., Cradit, J.D.: A variable-selection heuristic for k-means clustering. Psychometrika 66, 249-270 (2001) · Zbl 1293.62237 · doi:10.1007/BF02294838
[10] Campbell, N.A., Mahon, R.J.: A multivariate study of variation in two species of rock crab of the genus Leptograpsus. Aust. J. Zool. 22, 417-425 (1974) · doi:10.1071/ZO9740417
[11] Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781-793 (1995) · doi:10.1016/0031-3203(94)00125-6
[12] Celeux, G., Martin-Magniette, M.-L., Maugis, C., Raftery, A.E.: Letter to the editor. J. Am. Stat. Assoc. 106, 383 (2011) · Zbl 1430.62126 · doi:10.1198/jasa.2011.tm10681
[13] Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C., Raftery, A.E.: Comparing model selection and regularization approaches to variable selection in model-based clustering. J. Soc. Fr. Statistique 155, 57-71 (2014) · Zbl 1316.62083
[14] Chatterjee, S., Laudato, M., Lynch, L.A.: Genetic algorithms and their statistical applications: an introduction. Comput. Stat. Data Anal. 22, 633-651 (1996) · Zbl 0900.62336 · doi:10.1016/0167-9473(96)00011-4
[15] Dang, X.H., Bailey, J.: A framework to uncover multiple alternative clusterings. Mach. Learn. 98, 7-30 (2015) · Zbl 1321.68399 · doi:10.1007/s10994-013-5338-7
[16] Dang, UJ; McNicholas, PD; Morlini, I. (ed.); Minerva, T. (ed.); Vichi, M. (ed.), Families of parsimonious finite mixtures of regression models, 73-84 (2015), Berlin · doi:10.1007/978-3-319-17377-1_9
[17] De Sarbo, W.S., Cron, W.L.: A maximum likelihood methodology for clusterwise linear regression. J. Classif. 5, 249-282 (1988) · Zbl 0692.62052 · doi:10.1007/BF01897167
[18] Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1-22 (1977) · Zbl 0364.62022
[19] Dy, J.G., Brodley, C.E.: Feature selection for unsupervised learning. J. Mach. Learn. Res. 5, 845-889 (2004) · Zbl 1222.68187
[20] Fowlkes, E.B., Gnanadesikan, R., Kettenring, J.R.: Variable selection in clustering. J. Classif. 5, 205-228 (1988) · doi:10.1007/BF01897164
[21] Fraiman, R., Justel, A., Svarc, M.: Selection of variables for cluster analysis and classification rules. J. Am. Stat. Assoc. 103, 1294-1303 (2008) · Zbl 1205.62077 · doi:10.1198/016214508000000544
[22] Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611-631 (2002) · Zbl 1073.62545 · doi:10.1198/016214502760047131
[23] Fraley, C., Raftery, A.E., Murphy, T.B., Scrucca, L.: mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, Department of Statistics, University of Washington (2012) · Zbl 1520.62002
[24] Friedman, J.H., Meulman, J.J.: Clustering objects on subsets of attributes (with discussion). J. R. Stat. Soc. Ser. B 66, 815-849 (2004) · Zbl 1060.62064 · doi:10.1111/j.1467-9868.2004.02059.x
[25] Frühwirth-Schnatter, S.: Finite Mixture and Markow Switching Models. Springer, New York (2006) · Zbl 1108.62002
[26] Galimberti, G., Montanari, A., Viroli, C.: Penalized factor mixture analysis for variable selection in clustered data. Comput. Stat. Data Anal. 53, 4301-4310 (2009) · Zbl 1453.62094 · doi:10.1016/j.csda.2009.05.025
[27] Galimberti, G., Scardovi, E., Soffritti, G.: Using mixtures in seemingly unrelated linear regression models with non-normal errors. Stat. Comput. 26, 1025-1038 (2016) · Zbl 1505.62150 · doi:10.1007/s11222-015-9587-0
[28] Galimberti, G., Soffritti, G.: Model-based methods to identify multiple cluster structures in a data set. Comput. Stat. Data Anal. 52, 520-536 (2007) · Zbl 1452.62442 · doi:10.1016/j.csda.2007.02.019
[29] Galimberti, G., Soffritti, G.: Using conditional independence for parsimonious model-based Gaussian clustering. Stat. Comput. 23, 625-638 (2013) · Zbl 1322.62167 · doi:10.1007/s11222-012-9336-6
[30] Gnanadesikan, R., Kettenring, J.R., Tsao, S.L.: Weighting and selection of variables for cluster analysis. J. Classif. 12, 113-136 (1995) · Zbl 0825.62540 · doi:10.1007/BF01202271
[31] Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading (1989) · Zbl 0721.68056
[32] Gordon, A.D.: Classification, 2nd edn. Chapman & Hall, Boca Raton (1999) · Zbl 0929.62068
[33] Grün, B., Leisch, F.: Bootstrapping finite mixture models. In: Antoch, J. (ed.) Compstat 2004. Proceedings in computational statistics, pp. 1115-1122. Phisica-Verlag/Springer, Heidelberg (2004)
[34] Guo, J., Levina, E., Michailidis, G., Zhu, J.: Pairwise variable selection for high-dimensional model-based clustering. Biometrics 66, 793-804 (2010) · Zbl 1203.62190 · doi:10.1111/j.1541-0420.2009.01341.x
[35] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2009) · Zbl 1273.62005 · doi:10.1007/978-0-387-84858-7
[36] Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193-218 (1985) · Zbl 0587.62128 · doi:10.1007/BF01908075
[37] Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773-795 (1995) · Zbl 0846.62028 · doi:10.1080/01621459.1995.10476572
[38] Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā Ser. A 62, 49-66 (2000) · Zbl 1081.62516
[39] Law, M.H.C., Figueiredo, M.A.T., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1154-1166 (2004) · doi:10.1109/TPAMI.2004.71
[40] Liu, T.-F., Zhang, N.L., Chen, P., Liu, A.H., Poon, L.K.M., Wang, Y.: Greedy learning of latent tree models for multidimensional clustering. Mach. Learn. 98, 301-330 (2015) · Zbl 1321.68408 · doi:10.1007/s10994-013-5393-0
[41] Malsiner-Walli, G., Frühwirth-Schnatter, S., Grün, B.: Model-based clustering based on sparse finite Gaussian mixtures. Stat. Comput. 26, 303-324 (2016) · Zbl 1342.62109 · doi:10.1007/s11222-014-9500-2
[42] Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection for clustering with Gaussian mixture models. Biometrics 65, 701-709 (2009a) · Zbl 1172.62021 · doi:10.1111/j.1541-0420.2008.01160.x
[43] Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data Anal. 53, 3872-3882 (2009b) · Zbl 1453.62154 · doi:10.1016/j.csda.2009.04.013
[44] McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, Chichester (2000) · Zbl 0963.62061 · doi:10.1002/0471721182
[45] McLachlan, G.J., Peel, D., Bean, R.W.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41, 379-388 (2003) · Zbl 1256.62036 · doi:10.1016/S0167-9473(02)00183-4
[46] McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285-296 (2008) · doi:10.1007/s11222-008-9056-0
[47] McNicholas, P.D., Murphy, T.B., McDaid, A.F., Frost, D.: Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput. Stat. Data Anal. 54, 711-723 (2010) · Zbl 1464.62131 · doi:10.1016/j.csda.2009.02.011
[48] Melnykov, V., Maitra, R.: Finite mixture models and model-based clustering. Stat. Surv. 4, 80-116 (2010) · Zbl 1190.62121 · doi:10.1214/09-SS053
[49] Montanari, A., Lizzani, L.: A projection pursuit approach to variable selection. Comput. Stat. Data Anal. 35, 463-473 (2001) · Zbl 1080.62527 · doi:10.1016/S0167-9473(00)00026-8
[50] Pan, W., Shen, X.: Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 8, 1145-1164 (2007) · Zbl 1222.68279
[51] Poon, L.K.M., Zhang, N.L., Liu, T.-F., Liu, A.H.: Model-based clustering of high-dimensional data: variable selection versus facet determination. Int. J. Approx. Reason. 54, 196-215 (2013) · Zbl 1266.68160 · doi:10.1016/j.ijar.2012.08.001
[52] Quandt, R.E., Ramsey, J.B.: Estimating mixtures of normal distributions and switching regressions. J. Am. Stat. Assoc. 73, 730-738 (1978) · Zbl 0401.62024 · doi:10.1080/01621459.1978.10480085
[53] R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL:http://www.R-project.org (2015) · Zbl 1266.68160
[54] Raftery, A.E., Dean, N.: Variable selection for model-based cluster analysis. J. Am. Stat. Assoc. 101, 168-178 (2006) · Zbl 1118.62339 · doi:10.1198/016214506000000113
[55] Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461-464 (1978) · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[56] Scrucca, L.: GA: a package for genetic algorithms in R. J. Stat. Softw. 53, 1-37 (4) (2013)
[57] Scrucca, L.; Celebi, ME (ed.); Aydin, K. (ed.), Genetic algorithms for subset selection in model-based clustering, 55-70 (2016), Berlin · doi:10.1007/978-3-319-24211-8_3
[58] Scrucca, L., Raftery, A.E.: Improved initialisation of model-based clustering using Gaussian hierarchical partitions. Adv. Data Anal. Classif. 9, 447-460 (2015) · Zbl 1414.62272 · doi:10.1007/s11634-015-0220-z
[59] Scrucca, L., Raftery, A.E.: clustvarsel: a package implementing variable selection for model-based clustering in R (2014). Pre-print available at arxiv:1411.0606 · Zbl 1172.62021
[60] Soffritti, G.: Identifying multiple cluster structures in a data matrix. Commun. Stat. Simul. 32, 1151-1177 (2003) · Zbl 1100.62581 · doi:10.1081/SAC-120023883
[61] Soffritti, G., Galimberti, G.: Multivariate linear regression with non-normal errors: a solution based on mixture models. Stat. Comput. 21, 523-536 (2011) · Zbl 1221.62106 · doi:10.1007/s11222-010-9190-3
[62] Srivastava, M.S.: Methods of Multivariate Statistics. Wiley, New York (2002) · Zbl 1006.62048
[63] Steinley, D., Brusco, M.J.: A new variable weighting and selection procedure for k-means cluster analysis. Multivar. Behav. Res. 43, 77-108 (2008a) · doi:10.1080/00273170701836695
[64] Steinley, D., Brusco, M.J.: Selection of variables in cluster analysis: an empirical comparison of eight procedures. Psychometrika 73, 125-144 (2008b) · Zbl 1143.62327 · doi:10.1007/s11336-007-9019-y
[65] Tadesse, M.G., Sha, N., Vannucci, M.: Bayesian variable selection in clustering high-dimensional data. J. Am. Stat. Assoc. 100, 602-617 (2005) · Zbl 1117.62433 · doi:10.1198/016214504000001565
[66] Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002) · Zbl 1006.62003 · doi:10.1007/978-0-387-21706-2
[67] Viroli, C.: Dimensionally reduced model-based clustering through mixtures of factor mixture analyzers. J. Classif. 31, 363-388 (2010) · Zbl 1337.62141 · doi:10.1007/s00357-010-9063-7
[68] Wang, S., Zhu, J.: Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64, 440-448 (2008) · Zbl 1137.62041 · doi:10.1111/j.1541-0420.2007.00922.x
[69] Witten, D.M., Tibshirani, R.: A framework for feature selection in clustering. J. Am. Stat. Assoc. 105, 713-726 (2010) · Zbl 1392.62194
[70] Xie, B., Pan, W., Shen, X.: Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics 64, 921-930 (2008) · Zbl 1146.62101 · doi:10.1111/j.1541-0420.2007.00955.x
[71] Zeng, H., Cheung, Y.-M.: A new feature selection method for Gaussian mixture clustering. Pattern Recognit. 42, 243-250 (2009) · Zbl 1181.68261 · doi:10.1016/j.patcog.2008.05.030
[72] Zhou, H., Pan, W., Shen, X.: Penalized model-based clustering with unconstrained covariance matrices. Electron. J. Stat. 3, 1473-1496 (2009) · Zbl 1326.62143 · doi:10.1214/09-EJS487
[73] Zhu, X., Melnykov, V.: Manly transformation in finite mixture modeling. Comput. Stat. Data Anal. (2016). doi:10.1016/j.csda.2016.01.015 · Zbl 1469.62184 · doi:10.1016/j.csda.2016.01.015
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.