×

Model-based clustering of high-dimensional data: a review. (English) Zbl 1471.62032

Summary: Model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, high-dimensional data are nowadays more and more frequent and, unfortunately, classical model-based clustering techniques show a disappointing behavior in high-dimensional spaces. This is mainly due to the fact that model-based clustering methods are dramatically over-parametrized in this case. However, high-dimensional spaces have specific characteristics which are useful for clustering and recent techniques exploit those characteristics. After having recalled the bases of model-based clustering, dimension reduction approaches, regularization-based techniques, parsimonious modeling, subspace clustering methods and clustering methods based on variable selection are reviewed. Existing softwares for model-based clustering of high-dimensional data will be also reviewed and their practical use will be illustrated on real-world data sets.

MSC:

62-08 Computational methods for problems pertaining to statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P., 1998. Automatic subspace clustering of high-dimensional data for data mining application. In: ACM SIGMOD International Conference on Management of Data, pp. 94-105.
[2] Andrews, J. L.; McNicholas, P. D., Extending mixtures of multivariate \(t\)-factor analyzers, Statistics and Computing, 21, 3, 361-373, (2011) · Zbl 1255.62175
[3] Andrews, J. L.; McNicholas, P. D., Model-based clustering, classification, and discriminant analysis via mixtures of multivariate \(t\)-distributions, Statistics and Computing, 22, 5, 1021-1029, (2012) · Zbl 1252.62062
[4] Baek, J.; McLachlan, G. J.; Flack, L., Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualisation of high-dimensional data, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1-13, (2009)
[5] Banfield, J.; Raftery, A. E., Model-based gaussian and non-Gaussian clustering, Biometrics, 49, 803-821, (1993) · Zbl 0794.62034
[6] Bellman, R., Dynamic programming, (1957), Princeton University Press · Zbl 0077.13605
[7] Bergé, L.; Bouveyron, C.; Girard, S., Hdclassif: an R package for model-based clustering and discriminant analysis of high-dimensional data, Journal of Statistical Software, 42, 6, 1-29, (2012)
[8] Bickel, P. J.; Levina, E., Covariance regularization by thresholding, The Annals of Statistics, 36, 2577-2604, (2008) · Zbl 1196.62062
[9] Bickel, P. J.; Levina, E., Regularized estimation of large covariance matrices, The Annals of Statistics, 36, 199-227, (2008) · Zbl 1132.62040
[10] Biernacki, C.; Celeux, G.; Govaert, G., Assessing a mixture model for clustering with the integrated completed likelihood, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 7, 719-725, (2001)
[11] Biernacki, C.; Celeux, G.; Govaert, G.; Langrognet, F., Model-based cluster and discriminant analysis with the mixmod software, Computational Statistics and Data Analysis, 51, 587-600, (2006) · Zbl 1157.62431
[12] Biernacki, C.; Jacques, J., A generative model for rank data based on insertion sort algorithm, Computational Statistics and Data Analysis, 58, 0, 162-176, (2013) · Zbl 1365.62167
[13] Bishop, C. M., Pattern recognition and machine learning, (2006), Springer New York · Zbl 1107.68072
[14] Bouchard, G., Bouveyron, C., 2007. The statlearn toolbox: statistical learning tools for Matlab. http://statlearn.free.fr/.
[15] Bouchard, G.; Celeux, G., Model selection in supervised classification, IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 4, 544-554, (2005)
[16] Bouveyron, C.; Brunet, C., On the estimation of the latent discriminative subspace in the Fisher-EM algorithm, Journal de la Société Francaise de Statistique, 152, 3, 98-115, (2011) · Zbl 1316.62082
[17] Bouveyron, C., Brunet, C., 2012a. Discriminative variable selection for clustering with the sparse Fisher-EM algorithm. Technical Report Preprint HAL 00685183, Laboratoire SAMM, Université Paris 1 Panthéon-Sorbonne. · Zbl 1306.65033
[18] Bouveyron, C.; Brunet, C., Simultaneous model-based clustering and visualization in the Fisher discriminative subspace, Statistics and Computing, 22, 1, 301-324, (2012) · Zbl 1322.62162
[19] Bouveyron, C.; Brunet, C., Theoretical and practical considerations on the convergence properties of the Fisher-EM algorithm, Journal of Multivariate Analysis, 109, 29-41, (2012) · Zbl 1352.62099
[20] Bouveyron, C.; Celeux, G.; Girard, S., Intrinsic dimension estimation by maximum likelihood in isotropic probabilistic PCA, Pattern Recognition Letters, 32, 14, 1706-1713, (2011)
[21] Bouveyron, C.; Girard, S.; Schmid, C., High-dimensional data clustering, Computational Statistics and Data Analysis, 52, 1, 502-519, (2007) · Zbl 1452.62433
[22] Bouveyron, C.; Girard, S.; Schmid, C., High dimensional discriminant analysis, Communications in Statistics: Theory and Methods, 36, 14, 2607-2623, (2007) · Zbl 1128.62072
[23] Campbell, N.; Mahon, R. J., A multivariate study of variation in two species of rock crabs of genus leptograpsus, Australian Journal of Zoology, 22, 417-425, (1974)
[24] Cattell, R., The scree test for the number of factors, Multivariate Behavioral Research, 1, 2, 145-276, (1966)
[25] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Pattern Recognition, 28, 781-793, (1995)
[26] Celeux, G.; Martin-Magniette, M.-L.; Maugis, C.; Raftery, A. E., Letter to the editor, Journal of the American Statistical Association, 106, 493, (2011)
[27] Chang, W. C., On using principal component before separating a mixture of two multivariate normal distributions, Journal of the Royal Statistical Society, Series C, 32, 3, 267-275, (1983) · Zbl 0538.62050
[28] Chen, W. C.; Ostrouchov, G., Parallel model-based clustering, (2012), Oak Ridge National Laboratory Oak Ridge, TN, USA
[29] Dempster, A.; Laird, N.; Robin, D., Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39, 1, 1-38, (1977) · Zbl 0364.62022
[30] Duda, R.; Hart, P.; Stork, D., Pattern classification, (2000), John Wiley & Sons
[31] Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R., Least angle regression, The Annals of Statistics, 32, 407-499, (2004) · Zbl 1091.62054
[32] El Karoui, N., 2007. Operator norm consistent estimation of large dimensional sparse covariance matrices. Technical report 734, UC Berkeley, Department of Statistics. · Zbl 1196.62064
[33] Fisher, R. A., The use of multiple measurements in taxonomic problems, Annals of Eugenics, 7, 179-188, (1936)
[34] Foley, D. H.; Sammon, J. W., An optimal set of discriminant vectors, IEEE Transactions on Computers, 24, 281-289, (1975) · Zbl 0296.68106
[35] Fraley, C., Algorithms for model-based Gaussian hierarchical clustering, SIAM Journal on Scientific Computing, 20, 270-281, (1998) · Zbl 0911.62052
[36] Fraley, C.; Raftery, A. E., MCLUST: software for model-based cluster analysis, Journal of Classification, 16, 297-306, (1999) · Zbl 0951.91500
[37] Fraley, C.; Raftery, A. E., Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association, 97, 458, (2002) · Zbl 1073.62545
[38] Franczak, B.C., Browne, R.P., McNicholas, P.D., 2012. Mixtures of shifted asymmetric Laplace distributions. Preprint arXiv:1207.1727v2.
[39] Frank, A., Asuncion, A., 2010. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.
[40] Friedman, J. H., Regularized discriminant analysis, The Journal of the American Statistical Association, 84, 165-175, (1989)
[41] Friedman, J. H.; Hastie, T.; Tibshirani, R., Sparse inverse covariance estimation with the graphical lasso, Journal of the American Statistical Association, 104, 177-186, (2008) · Zbl 1143.62076
[42] Fukunaga, K., Introduction to statistical pattern recognition, (1990), Academic. Press San Diego · Zbl 0711.62052
[43] Galimberti, G.; Montanari, A.; Viroli, C., Penalized factor mixture analysis for variable selection in clustered data, Computational Statistics and Data Analysis, 53, 12, 4301-4310, (2009) · Zbl 1453.62094
[44] Galimberti, G.; Soffritti, G., Using conditional independence for parsimonious model-based Gaussian clustering, Statistics and Computing, (2012) · Zbl 1322.62167
[45] Ghahramani, Z., Hinton, G.E., 1997. The EM algorithm for factor analyzers. Technical report, University of Toronto.
[46] Hall, P.; Marron, J.; Neeman, A., Geometric representation of high dimension, low sample size data, Journal of the Royal Statistical Society, Serie B, 67, 3, 427-444, (2005) · Zbl 1069.62097
[47] Hastie, T.; Buja, A.; Tibshirani, R., Penalized discriminant analysis, The Annals of Statistics, 23, 73-102, (1995) · Zbl 0821.62031
[48] Hotelling, H., Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology, 24, 417-441, (1933)
[49] Huber, P., Projection pursuit, The Annals of Statistics, 13, 2, 435-525, (1985) · Zbl 0595.62059
[50] Law, M.; Figueiredo, M.; Jain, A., Simultaneous feature selection and clustering using mixture models, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 9, 1154-1166, (2004)
[51] Ledoit, O.; Wolf, M., A well-conditioned estimator for large-dimensional covariance matrices, Journal of Multivariate Analysis, 88, 365-411, (2003) · Zbl 1032.62050
[52] Lee, -J. C.; Lin, T. I.; Hsieh, W. J., Robust mixture modeling using the skew \(t\)-distribution, Statistics and Computing, 17, 81-92, (2007)
[53] Lee, S.; McLachlan, G. J., Finite mixtures of multivariate skew \(t\)-distributions: some recent and new results, Statistics and Computing, (2013)
[54] Lee, G.; Scott, C., EM algorithms for multivariate Gaussian mixture models with truncated and censored data, Computational Statistics and Data Analysis, 56, 9, 2816-2829, (2012) · Zbl 1255.62308
[55] Lin, T. I., Robust mixture modeling using multivariate skew t distribution, Statistics and Computing, 20, 343-356, (2010)
[56] Lindsay, B. G., (Mixture Models: Theory, Geometry and Applications, NSF-CBMS Regional Conference Series in Probability and Statistics, vol. 5, (1995), Institute of Mathematical Statistics)
[57] Liu, J.; Zhang, J. L.; Palumbo, M. J.; Lawrence, C. E., Bayesian clustering with variable and transformation selection, Bayesian Statistics, 7, 249-276, (2003)
[58] MacQueen, J., Some methods for classification and analysis of multivariate observations, (Cam, L. M.; Neyman, J., Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, (1967), University of California Press), 281-297 · Zbl 0214.46201
[59] Manolopoulou, I.; Kepler, T. B.; Merl, D. M., Mixtures of Gaussian wells: theory, computation, and application, Computational Statistics and Data Analysis, 56, 12, 3809-3820, (2012) · Zbl 1255.62178
[60] Maugis, C., 2009. The selvarclust software. http://www.math.univ-toulouse.fr/ maugis/SelvarClustHomepage.html.
[61] Maugis, C.; Celeux, G.; Martin-Magniette, M.-L., Variable selection for clustering with Gaussian mixture models, Biometrics, 65, 3, 701-709, (2009) · Zbl 1172.62021
[62] Maugis, C.; Celeux, G.; Martin-Magniette, M.-L., Variable selection in model-based clustering: a general variable role modeling, Computational Statistics and Data Analysis, 53, 3872-3882, (2009) · Zbl 1453.62154
[63] McLachlan, G.J., 2003. The EMMIX-MFA software. http://www.maths.uq.edu.au/ gjm/mix_soft/mfa/.
[64] McLachlan, G.J., 2010a. The EMMIX software. http://www.maths.uq.edu.au/ gjm/mix_soft/EMMIX_R/index.html.
[65] McLachlan, G.J., 2010b. The mcfa function for the R software. http://www.maths.uq.edu.au/ gjm/mix_soft/mcfa/.
[66] McLachlan, G. J.; Basford, K. E., Mixture models: inference and applications to clustering, (1988), Marcel Dekker New York · Zbl 0697.62050
[67] McLachlan, G. J.; Bean, R. W.; Ben-Tovim Jones, L., Extension of the mixture of factor analyzers model to incorporate the multivariate \(t\)-distribution, Computational Statistics and Data Analysis, 51, 5327-5338, (2011) · Zbl 1445.62053
[68] McLachlan, G. J.; Krishnan, T., The EM algorithm and extensions, (1997), Wiley Interscience New York · Zbl 0882.62012
[69] McLachlan, G. J.; Peel, D., Robust cluster analysis via mixtures of multivariate \(t\)-distributions, Lecture Notes in Computer Science, 1451, 658-666, (1998)
[70] McLachlan, G. J.; Peel, D., Finite mixture models, (2000), Wiley Interscience New York · Zbl 0963.62061
[71] McLachlan, G. J.; Peel, D.; Basford, K. E.; Adams, P., The emmix software for the Fitting of mixtures of normal \(t\)-components, Journal of Statistical Software, 4, 2, 1-14, (1999)
[72] McLachlan, G. J.; Peel, D.; Bean, R., Modelling high-dimensional data by mixtures of factor analyzers, Computational Statistics and Data Analysis, 41, 379, (2003) · Zbl 1256.62036
[73] McNicholas, P. D.; Murphy, T. B., Parsimonious Gaussian mixture models, Statistics and Computing, 18, 3, 285-296, (2008)
[74] McNicholas, P. D.; Murphy, T. B., Model-based clustering of microarray expression data via latent Gaussian mixture models, Bioinformatics, 26, 21, 2705-2712, (2010)
[75] McNicholas, P.D., Murphy, T.B., Jampani, K.R., McDaid, A.F., Banks, L., 2011. Pgmm Version 1.0 for R: Model-based clustering and classification via latent Gaussian mixture models. Technical Report 320, Department of Mathematics and Statistics, University of Guelph.
[76] Melnykov, V.; Melnykov, I., Initializing the EM algorithm in Gaussian mixture models with an unknown number of components, Computational Statistics and Data Analysis, 56, 6, 1381-1395, (2012) · Zbl 1246.65025
[77] Meng, X-L.; Van Dyk, D., The EM algorithm — an old folk song sung to a fast new tune, Journal of the Royal Statistical Society, Series B, 59, 3, 511-567, (1997) · Zbl 1090.62518
[78] Mkhadri, A.; Celeux, G.; Nasrollah, A., Regularization in discriminant analysis: a survey, Computational Statistics and Data Analysis, 23, 403-423, (1997) · Zbl 0875.62266
[79] Mo, C., 2009. emgm: EM algorithm for Gaussian mixture model. http://www.mathworks.com/matlabcentral/fileexchange/26184.
[80] Montanari, A.; Viroli, C., Heteroscedastic factor mixture analysis, Statistical Modelling, 10, 4, 441-460, (2010)
[81] Murtagh, F., The remarkable simplicity of very high dimensional data: application of model-based clustering, Journal of Classification, 26, 249-277, (2009) · Zbl 1337.62136
[82] Murtagh, F.; Raftery, A. E., Fitting straight lines to point patterns, Pattern Recognition, 17, 479-483, (1984)
[83] O’Hagan, A.; Murphy, T. B.; Gormley, I. C., Computational aspects of Fitting mixture models via the expectation-maximization algorithm, Computational Statistics and Data Analysis, 56, 12, 3843-3864, (2012) · Zbl 1255.62180
[84] Pan, W.; Shen, X., Penalized model-based clustering with application to variable selection, Journal of Machine Learning Research, 8, 1145-1164, (2007) · Zbl 1222.68279
[85] Parsons, L.; Haque, E.; Liu, H., Subspace clustering for high-dimensional data: a review, SIGKDD Exploration Newsletter, 6, 1, 69-76, (1998)
[86] Partovi Nia, V.; Davison, A. C., High-dimensional Bayesian clustering with variable selection: the R package bclust, Journal of Statistical Software, 47, 5, 1-22, (2012)
[87] Pavlenko, T., On feature selection, curse of dimensionality and error probability in discriminant analysis, Journal of Statistical Planning and Inference, 115, 565-584, (2003) · Zbl 1015.62066
[88] Pavlenko, T.; Von Rosen, D., Effect of dimensionality on discrimination, Statistics, 35, 3, 191-213, (2001) · Zbl 0980.62050
[89] Pearson, K., On lines and planes of closest fit to systems of points in space, Philosophical Magazine, 6, 2, 559-572, (1901) · JFM 32.0246.07
[90] Raftery, A. E.; Dean, N., Variable selection for model-based clustering, Journal of the American Statistical Association, 101, 473, 168-178, (2006) · Zbl 1118.62339
[91] Rubin, D.; Thayer, D., EM algorithms for ML factor analysis, Psychometrika, 47, 1, 69-76, (1982) · Zbl 0483.62046
[92] Sanguinetti, G., Dimensionality reduction of clustered datasets, IEEE Transactions On Pattern Analysis And Machine Intelligence, 30, 3, 1-29, (2008)
[93] Schwarz, G., Estimating the dimension of a model, The Annals of Statistics, 6, 461-464, (1978) · Zbl 0379.62005
[94] Scott, A. J.; Symons, M. J., Clustering methods based on likelihood ratio criteria, Biometrics, 27, 387-397, (1971)
[95] Scott, D., Thompson, J., 1983. Probability density estimation in higher dimensions, In: Fifteenth Symposium in the Interface, pp. 173-179.
[96] Scrucca, L., Dimension reduction for model-based clustering, Statistics and Computing, 20, 4, 471-484, (2010)
[97] Spearman, C., The proof and measurement of association between two things, American Journal of Psychology, 15, 72-101, (1904)
[98] Steiner, P. M.; Hudec, M., Classification of large data sets with mixture models via sufficient EM, Computational Statistics and Data Analysis, 51, 11, 5416-5428, (2007) · Zbl 1445.62153
[99] Tipping, M.E., Bishop, C.M., 1997. Probabilistic principal component analysis. Technical Report NCRG-97-010, Neural Computing Research Group, Aston University. · Zbl 0924.62068
[100] Tipping, M. E.; Bishop, C. M., Mixtures of probabilistic principal component analysers, Neural Computation, 11, 2, 443-482, (1999)
[101] Tran, T. N.; Wehrens, R.; Buydens, L. M.C., Knn-kernel density-based clustering for high-dimensional multivariate data, Computational Statistics and Data Analysis, 51, 2, 513-525, (2006) · Zbl 1157.62448
[102] Tritchler, D.; Fallah, S.; Beyene, J., A spectral clustering method for microarray data, Computational Statistics and Data Analysis, 49, 1, 63-76, (2005) · Zbl 1429.62266
[103] Venables, W. N.; Ripley, B. D., Modern applied statistics with S, (2002), Springer · Zbl 1006.62003
[104] Viroli, C., 2010a. The hmfa function for the R software. http://www2.stat.unibo.it/viroli/Cinzia_Viroli/Software_&_Data.html.
[105] Viroli, C., 2010b. The mmfa function for the R software. http://www2.stat.unibo.it/viroli/Software/MFMA_1.0.tar.gz.
[106] von Borries, G.; Wang, H., Partition clustering of high dimensional low sample size data based on \(p\)-values, Computational Statistics and Data Analysis, 53, 12, 3987-3998, (2009) · Zbl 1453.62233
[107] Vrbik, I.; McNicholas, P. D., Analytic calculations for the EM algorithm for multivariate skew-\(t\) mixture models, Statistics & Probability Letters, 82, 1169-1174, (2012) · Zbl 1244.65012
[108] Wang, S.; Zhou, J., Variable selection for model-based high dimensional clustering and its application to microarray data, Biometrics, 64, 440-448, (2008) · Zbl 1137.62041
[109] Ward, J. H., Hierarchical groupings to optimize an objective function, Journal of the American Statistical Association, 58, 234-244, (1963)
[110] Witten, D. M.; Tibshirani, R., A framework for feature selection in clustering, Journal of the American Statistical Association, 105, 490, 713-726, (2010) · Zbl 1392.62194
[111] Wolfe, J.H., 1963. Object cluster analysis of social areas. Master’s thesis, University of California, Berkeley.
[112] Wu, C., On the convergence properties of the EM algorithm, The Annals of Statistics, 11, 95-103, (1983) · Zbl 0517.62035
[113] Xie, B.; Pan, W.; Shen, X., Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables, Electrical Journal of Statistics, 2, 168-212, (2008)
[114] Xie, B.; Pan, W.; Shen, X., Penalized mixtures of factor analyzers with application to clustering high-dimensional microarray data, Bioinformatics, 26, 4, 501-508, (2010)
[115] Yoshida, R.; Higuchi, T.; Imoto, S., A mixed factor model for dimension reduction and extraction of a group structure in gene expression data, IEEE Computational Systems Bioinformatics Conference, 8, 161-172, (2004)
[116] Yoshida, R.; Higuchi, T.; Imoto, S.; Miyano, S., Array cluster: an analytic tool for clustering, data visualization and model finder on gene expression profiles, Bioinformatics, 22, 1538-1539, (2006)
[117] Zhang, Z., Dai, G., Jordan, M.I., 2009. A flexible and efficient algorithm for regularized fisher discriminant analysis, In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 632-647.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.