## Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models.(English)Zbl 1464.62131

Summary: Model-based clustering using a family of Gaussian mixture models, with parsimonious factor analysis like covariance structure, is described and an efficient algorithm for its implementation is presented. This algorithm uses the alternating expectation-conditional maximization (AECM) variant of the expectation-maximization (EM) algorithm. Two central issues around the implementation of this family of models, namely model selection and convergence criteria, are discussed. These central issues also have implications for other model-based clustering techniques and for the implementation of techniques like the EM algorithm, in general. The Bayesian information criterion (BIC) is used for model selection and Aitken’s acceleration, which is shown to outperform the lack of progress criterion, is used to determine convergence. A brief introduction to parallel computing is then given before the implementation of this algorithm in parallel is facilitated within the master-slave paradigm. A simulation study is then carried out to confirm the effectiveness of this parallelization. The resulting software is applied to two datasets to demonstrate its effectiveness when compared to existing software.

### MSC:

 62-08 Computational methods for problems pertaining to statistics 62H30 Classification and discrimination; cluster analysis (statistical aspects) 65Y05 Parallel numerical computation

### Software:

clustvarsel; MASS (R); mclust; MPI; PGMM; AutoClass; S-PLUS; R
Full Text:

### References:

 [1] Banfield, J.D.; Raftery, A.E., Model-based gaussian and non-Gaussian clustering, Biometrics, 49, 3, 803-821, (1993) · Zbl 0794.62034 [2] Böhning, D.; Dietz, E.; Schaub, R.; Schlattmann, P.; Lindsay, B., The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family, Annals of the institute of statistical mathematics, 46, 373-388, (1994) · Zbl 0802.62017 [3] Campbell, N.A.; Mahon, R.J., A multivariate study of variation in two species of rock crab of genus leptograpsus, Australian journal of zoology, 22, 417-425, (1974) [4] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Pattern recognition, 28, 781-793, (1995) [5] Cheeseman, P.; Stutz, J., Bayesian classification (autoclass): theory and results, (1996), American Association for Artificial Intelligence Menlo Park, CA, USA, pp. 153-180 [6] Dean, N., Raftery, A.E., 2006. The clustvarsel package. R Package Version 0.2-4 [7] Dempster, A.P.; Laird, N.M.; Rubin, D.B., Maximum likelihood from incomplete data via the EM algorithm, Journal of the royal statistical society. series B, 39, 1, 1-38, (1977) · Zbl 0364.62022 [8] Forina, M.; Armanino, C.; Castino, M.; Ubigli, M., Multivariate data analysis as a discriminating method of the origin of wines, Vitis, 25, 189-201, (1986) [9] Fraley, C.; Raftery, A.E., How many clusters? which clustering methods? answers via model-based cluster analysis, The computer journal, 41, 8, 578-588, (1998) · Zbl 0920.68038 [10] Fraley, C.; Raftery, A.E., Model-based clustering, discriminant analysis, and density estimation, Journal of the American statistical association, 97, 611-631, (2002) · Zbl 1073.62545 [11] Fraley, C.; Raftery, A.E., Enhanced software for model-based clustering, density estimation, and discriminant analysis: MCLUST, Journal of classification, 20, 263-286, (2003) · Zbl 1055.62071 [12] Gatu, C.; Yanev, P.; Kontoghiorghes, E., A graph approach to generate all possible regression submodels, Computational statistics and data analysis, 52, 2, 799-815, (2007) · Zbl 1452.62061 [13] Ghahramani, Z., Hinton, G.E., 1997. The EM algorithm for factor analyzers. Technical Report CRG-TR-96-1, University of Toronto, Toronto [14] Gropp, W.; Lusk, E.; Skjellum, A., Using MPI: portable parallel programming with the message-passing interface, (1999), MIT Press Cambridge, MA [15] Hubert, L.; Arabie, P., Comparing partitions, Journal of classification, 2, 193-218, (1985) [16] Kass, R.E.; Raftery, A.E., Bayes factors, Journal of the American statistical association, 90, 773-795, (1995) · Zbl 0846.62028 [17] Keribin, C., Consistent estimation of the order of mixture models, sankhyā, The Indian journal of statistics. series A, 62, 1, 49-66, (2000) · Zbl 1081.62516 [18] Kontoghiorghes, E.J., Parallel algorithms for linear models: numerical methods and estimation problems, () [19] Leroux, B.G., Consistent estimation of a mixing distribution, The annals of statistics, 20, 1350-1360, (1992) · Zbl 0763.62015 [20] Lindsay, B.G., Mixture models: theory, geometry and applications, () · Zbl 1163.62326 [21] Lindstrom, M.J.; Bates, D.M., Newton – raphson and EM algorithms for linear mixed-effects models for repeated-measures data, Journal of the American statistical association, 83, 404, 1014-1022, (1988) · Zbl 0671.65119 [22] Lopes, H.F.; West, M., Bayesian model assessment in factor analysis, Statistica sinica, 14, 41-67, (2004) · Zbl 1035.62060 [23] McLachlan, G.J.; Krishnan, T., The EM algorithm and extensions, (2008), Wiley New York · Zbl 1165.62019 [24] McLachlan, G.J.; Peel, D., Robust cluster analysis via mixtures of multivariate $$t$$-distributions, (), 658-666 [25] McLachlan, G.J.; Peel, D., Finite mixture models, (2000), John Wiley & Sons New York · Zbl 0963.62061 [26] McLachlan, G.J.; Peel, D., Mixtures of factor analyzers, (), 599-606 [27] McLachlan, G.J.; Peel, D.; Bean, R.W., Modelling high-dimensional data by mixtures of factor analyzers, Computational statistics & data analysis, 41, 3-4, 379-388, (2003) · Zbl 1256.62036 [28] McNicholas, P.D., Murphy, T.B., 2005. Parsimonious Gaussian mixture models. Technical Report 05/11, Department of Statistics, Trinity College, Dublin [29] McNicholas, P.D.; Murphy, T.B., Parsimonious Gaussian mixture models, Statistics and computing, 18, 3, 285-296, (2008) [30] Meng, X.L.; Rubin, D.B., Maximum likelihood estimation via the ECM algorithm: A general framework, Biometrika, 80, 267-278, (1993) · Zbl 0778.62022 [31] Meng, X.L.; van Dyk, The EM algorithm — an old folk song sung to the fast tune (with discussion), Journal of the royal statistical society. series B, 59, 511-567, (1997) · Zbl 1090.62518 [32] Milidiú, R.L.; Rentera, R.P., Dpls and ppls: two pls algorithms for large data sets, Computational statistics and data analysis, 48, 1, 125-138, (2005) · Zbl 1429.62028 [33] Pizzuti, C.; Talia, D., P-autoclass: scalable parallel clustering for mining large data sets, IEEE transactions on knowledge and data engineering, 15, 3, 629-641, (2003) [34] Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P., Numerical recipes in C — the art of scientific computation, (1992), Cambridge University Press · Zbl 0845.65001 [35] R Development Core Team, R: A language and environment for statistical computing, (2008), R Foundation for Statistical Computing Vienna, Austria [36] Racine, J., Parallel distributed kernel estimation, Computational statistics & data analysis, 40, 2, 293-302, (2002) · Zbl 0993.62033 [37] Raftery, A.E.; Dean, N., Variable selection for model-based clustering, Journal of the American statistical association, 101, 473, 168-178, (2006) · Zbl 1118.62339 [38] Rand, W.M., Objective criteria for the evaluation of clustering methods, Journal of the American statistical association, 66, 846-850, (1971) [39] Ripley, B.D., Pattern recognition and neural networks, (1996), Cambridge University Press Cambridge, UK · Zbl 0853.62046 [40] Schwartz, G., Estimating the dimension of a model, The annals of statistics, 6, 31-38, (1978) [41] Spearman, C., The proof and measurement of association between two things, American journal of psychology, 15, 72-101, (1904) [42] Tipping, T.E.; Bishop, C.M., Mixtures of probabilistic principal component analysers, Neural computation, 11, 2, 443-482, (1999) [43] Titterington, D.M.; Smith, A.F.M.; Makov, U.E., Statistical analysis of finite mixture distributions, (1985), John Wiley & Sons Chichester · Zbl 0646.62013 [44] Venables, W.N.; Ripley, B.D., Modern applied statistics with S-PLUS, (2002), Springer New York · Zbl 1006.62003 [45] Yanev, P.; Kontoghiorghes, E.J., Efficient algorithms for estimating the general linear model, Parallel computing, 32, 2, 195-204, (2006)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.