×

Parsimonious mixtures of multivariate contaminated normal distributions. (English) Zbl 1353.62124

Summary: A mixture of multivariate contaminated normal distributions is developed for model-based clustering. In addition to the parameters of the classical normal mixture, our contaminated mixture has, for each cluster, a parameter controlling the proportion of mild outliers and one specifying the degree of contamination. Crucially, these parameters do not have to be specified a priori, adding a flexibility to our approach. Parsimony is introduced via eigen-decomposition of the component covariance matrices, and sufficient conditions for the identifiability of all the members of the resulting family are provided. An expectation-conditional maximization algorithm is outlined for parameter estimation and various implementation issues are discussed. Using a large-scale simulation study, the behavior of the proposed approach is investigated and comparison with well-established finite mixtures is provided. The performance of this novel family of models is also illustrated on artificial and real data.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F10 Point estimation
62-07 Data analysis (statistics) (MSC2010)
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Aggarwal, Outlier Analysis (2013) · Zbl 1291.68004
[2] Aitken, A series formula for the roots of algebraic and transcendental equations, Proceedings of the Royal Society of Edinburgh 45 pp 14– (1926) · JFM 51.0096.03
[3] Aitkin, Mixture models, outliers, and the EM algorithm, Technometrics 22 pp 325– (1980) · Zbl 0466.62034
[4] Andrews, Model-based clustering, classification, and discriminant analysis with the multivariate t-distribution: the tEIGEN family, Statistics and Computing 22 pp 1021– (2012) · Zbl 1252.62062
[5] Andrews , J. L. Wickins , J. R. Boers , N. M. McNicholas , P. D. 2015 teigen: model-based clustering and classification with the multivariate t distribution. Version 2.1.0 (2015-11-20) http://CRAN.R-project.org/package=teigen
[6] Bagnato, Finite mixtures of unimodal beta and gamma densities and the k-bumps algorithm, Computational Statistics 28 pp 1571– (2013) · Zbl 1306.65024
[7] Bai, Robust fitting of mixture regression models, Computational Statistics and Data Analysis 56 pp 2347– (2012) · Zbl 1252.62011
[8] Banfield, Model-based Gaussian and non-Gaussian clustering, Biometrics 49 pp 803– (1993) · Zbl 0794.62034
[9] Barnett, Outliers in Statistical Data (1994)
[10] Becker, The masking breakdown point of multivariate outlier identification rules, Journal of the American Statistical Association 94 pp 947– (1999) · Zbl 1072.62600
[11] Berkane, Estimation of contamination parameters and identification of outliers in multivariate data, Sociological Methods and Research 17 pp 55– (1988)
[12] Biernacki , C. 2004 An asymptotic upper bound of the likelihood to prevent Gaussian mixtures from degenerating. Tech. rep., Université de Franche-Comté, Besançon, FR
[13] Biernacki, Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models, Computational Statistics and Data Analysis 41 pp 561– (2003) · Zbl 1429.62235
[14] Biernacki , C. Celeux , G. Govaert , G. Langrognet , F. Noulin , G. Vernaz , Y. 2008 MIXMOD http://www.mixmod.org/IMG/pdf/statdoc_2_1_1.pdf
[15] Biernacki, Degeneracy in the maximum likelihood estimation of univariate Gaussian mixtures with EM, Statistics and Probability Letters 61 pp 373– (2003) · Zbl 1038.62023
[16] Bock, Clustering methods: from classical models to new approaches, Statistics in Transition 5 pp 725– (2002)
[17] Böhning, Computer-assisted analysis of mixtures and applications: meta-analysis, disease mapping and others. Vol. 81 of Monographs on Statistics and Applied Probability (2000) · Zbl 0951.62088
[18] Böhning, The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family, Annals of the Institute of Statistical Mathematics 46 pp 373– (1994) · Zbl 0802.62017
[19] Böhning, A note on the maximum deviation of the scale-contaminated normal to the best normal distribution, Metrika 55 pp 177– (2002) · Zbl 1320.62035
[20] Browne, Estimating common principal components in high dimensions, Advances in Data Analysis and Classification 8 pp 217– (2014)
[21] Browne , R. P. McNicholas , P. D. 2015 http://CRAN.R-project.org/package=mixture
[22] Browne, Model-based learning using a mixture of mixtures of Gaussian and uniform distributions, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 pp 814– (2012)
[23] Browne , R. P. Subedi , S. McNicholas , P. D. 2013 Constrained optimization for a subset of the Gaussian parsimonious clustering models http://arxiv.org/abs/1306.5824
[24] Byers, Nearest-neighbor clutter removal for estimating features in spatial point processes, Journal of the American Statistical Association 93 pp 577– (1998) · Zbl 0926.62089
[25] Campbell, Mixture models and atypical values, Mathematical Geology 16 pp 465– (1984)
[26] Campbell, A multivariate study of variation in two species of rock crab of genus Leptograpsus, Australian Journal of Zoology 22 pp 417– (1974)
[27] Celeux, Gaussian parsimonious clustering models, Pattern Recognition 28 pp 781– (1995) · Zbl 05480211
[28] Celeux, Computational and inferential difficulties with mixture posterior distributions, Journal of the American Statistical Association 95 pp 957– (2000) · Zbl 0999.62020
[29] Coretto, Maximum likelihood estimation of heterogeneous mixtures of Gaussian and uniform distributions, Journal of Statistical Planning and Inference 141 pp 462– (2011) · Zbl 1203.62017
[30] Coretto , P. Hennig , C. 2015 Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering http://arxiv.org/abs/1406.0808
[31] Crawford, An application of the laplace method to finite mixture distributions, Journal of the American Statistical Association 89 pp 259– (1994) · Zbl 0795.62022
[32] Cuesta-Albertos, Trimmed k-means: An attempt to robustify quantizers, The Annals of Statistics 25 pp 553– (1997) · Zbl 0878.62045
[33] Davies, The identification of multiple outliers, Journal of the American Statistical Association 88 pp 782– (1993) · Zbl 0797.62025
[34] De Veaux, Robust estimation of a normal mixture, Statistics and Probability Letters 10 pp 1– (1990)
[35] Dempster, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society: Series B 39 pp 1– (1977) · Zbl 0364.62022
[36] Di Zio, A mixture of mixture models for a classification problem: the unity measure error, Computational Statistics and Data Analysis 51 pp 2573– (2007) · Zbl 1161.62373
[37] Flury, An algorithm for simultaneous orthogonal transformation of several positive definite matrices to nearly diagonal form, SIAM Journal on Scientific and Statistical Computing 7 pp 169– (1986) · Zbl 0614.65043
[38] Forina , M. Leardi , R. Armanino , C. Lanteri , S. 1998 PARVUS: an extendible package for data exploration, classification and correlation. Tech. rep., Institute of Pharmaceutical and Food Analysis and Technologies, Genoa, IT
[39] Fraley, How many clusters? Which clustering method? Answers via model-based cluster analysis, Computer Journal 41 pp 578– (1998) · Zbl 0920.68038
[40] Fraley , C. Raftery , A. E. Murphy , T. B. Scrucca , L. 2012 mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical report 597, Department of Statistics, University of Washington, Seattle, WA
[41] Fraley , C. Raftery , A. E. Scrucca , L. Murphy , T. B. Fop , M. 2015 http://CRAN.R-project.org/package=mclust
[42] Gallegos, A robust method for cluster analysis, The Annals of Statistics 33 pp 347– (2005) · Zbl 1064.62074
[43] Gallegos, Trimmed ML estimation of contaminated mixtures, Sankhyā: The Indian Journal of Statistics, Series A 71 pp 164– (2009)
[44] García-Escudero, Robustness properties of k means and trimmed k means, Journal of the American Statistical Association 94 pp 956– (1999) · Zbl 1072.62547
[45] García-Escudero, Trimming tools in exploratory data analysis, Journal of Computational and Graphical Statistics 12 pp 434– (2003)
[46] García-Escudero, A general trimming approach to robust cluster analysis, The Annals of Statistics 36 pp 1324– (2008) · Zbl 1360.62328
[47] García-Escudero, A review of robust clustering methods, Advances in Data Analysis and Classification 4 pp 89– (2010) · Zbl 1284.62375
[48] Gerogiannis, The mixtures of Student’s t-distributions as a robust framework for rigid registration, Image and Vision Computing 27 pp 1285– (2009) · Zbl 05842173
[49] Hartigan, Statistical theory in clustering, Journal of Classification 2 pp 63– (1985) · Zbl 0575.62058
[50] Hastie, Discriminant analysis by Gaussian mixtures, Journal of the Royal Statistical Society: Series B 58 pp 155– (1996) · Zbl 0850.62476
[51] Hathaway, A constrained EM algorithm for univariate normal mixtures, Journal of Statistical Computation and Simulation 23 pp 211– (1986)
[52] Hawkins, Identification of Outliers. Monographs on Statistics and Applied Probability (2013) · Zbl 0438.62022
[53] Hennig, Fixed point clusters for linear regression: computation and comparison, Journal of Classification 19 pp 249– (2002) · Zbl 1017.62057
[54] Hennig, Breakdown points for maximum likelihood estimators of location-scale mixtures, The Annals of Statistics 32 pp 1313– (2004) · Zbl 1047.62063
[55] Hennig , C. Hausdorf , B. 2015 http://CRAN.R-project.org/package=prabclus
[56] Holzmann, Identifiability of finite mixtures of elliptical distributions, Scandinavian Journal of Statistics 33 pp 753– (2006) · Zbl 1164.62354
[57] Hunter, Rejoinder to discussion of ”optimization transfer using surrogate objective functions, Journal of Computational and Graphical Statistics 9 pp 52– (2000)
[58] Hurley, Clustering visualizations of multivariate data, Journal of Computational and Graphical Statistics 13 pp 788– (2004)
[59] Ingrassia, A likelihood-based constrained algorithm for multivariate normal mixture models, Statistical Methods and Applications 13 pp 151– (2004) · Zbl 1205.62066
[60] Ingrassia, Constrained monotone em algorithms for finite mixture of multivariate Gaussians, Computational Statistics and Data Analysis 51 pp 5339– (2007) · Zbl 1445.62116
[61] Ingrassia, Degeneracy of the EM algorithm for the mle of multivariate Gaussian mixtures and dynamic constraints, Computational Statistics and Data Analysis 55 pp 1715– (2011) · Zbl 1328.65030
[62] Karlis, Choosing initial values for the EM algorithm for finite mixtures, Computational Statistics and Data Analysis 41 pp 577– (2003) · Zbl 1429.62082
[63] Lebret , R. Iovleff , S. Langrognet , F. Biernacki , C. Celeux , G. Govaert , G. 2012 Rmixmod: The R Package of the Model-Based Unsupervised, Supervised and Semi-Supervised Classification Mixmod Library
[64] Li, Clustering based on a multi-layer mixture model, Journal of Computational and Graphical Statistics 14 pp 547– (2005)
[65] Little, Robust estimation of the mean and covariance matrix from data with missing values, Applied Statistics 37 pp 23– (1988) · Zbl 0647.62040
[66] Lo, Likelihood ratio tests of the number of components in a normal mixture with unequal variances, Statistics and Probability Letters 71 pp 225– (2005) · Zbl 1065.62024
[67] Lo, A likelihood ratio test of a homoscedastic normal mixture against a heteroscedastic normal mixture, Statistics and Computing 18 pp 233– (2008)
[68] Lo, Testing the number of components in a normal mixture, Biometrika 88 pp 767– (2001) · Zbl 0985.62019
[69] Markatou, Mixture models, robustness, and the weighted likelihood methodology, Biometrics 56 pp 483– (2000) · Zbl 1060.62511
[70] McLachlan, Vol. 382 of Wiley Series in Probability and Statistics (2007)
[71] McLachlan, Mixture Models: Inference and Applications to Clustering (1988) · Zbl 0697.62050
[72] McLachlan, Advances in Pattern Recognition. Vol. 1451 of Lecture Notes in Computer Science pp 658– (1998)
[73] McLachlan, Finite Mixture Models (2000) · Zbl 0963.62061
[74] McNicholas, Model-based classification using latent Gaussian mixture models, Journal of Statistical Planning and Inference 140 pp 1175– (2010) · Zbl 1181.62095
[75] McNicholas, Mixture Model-Based Classification (2016) · Zbl 1454.62005
[76] McNicholas, Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models, Computational Statistics and Data Analysis 54 pp 711– (2010) · Zbl 1464.62131
[77] Meng, Maximum likelihood estimation via the ECM algorithm: a general framework, Biometrika 80 pp 267– (1993) · Zbl 0778.62022
[78] Peel, Robust mixture modelling using the t distribution, Statistics and Computing 10 pp 339– (2000)
[79] Punzo, Hypothesis testing for mixture model selection, Journal of Statistical Computation and Simulation 86 pp 2797– (2016)
[80] Punzo , A. Mazza , A. McNicholas , P. D. 2015 http://CRAN.R-project.org/package=ContaminatedMixt
[81] 2015 http://www.R-project.org/
[82] Raftery, Bayesian model selection in social research, Sociological Methodology 25 pp 111– (1995)
[83] Ritter, Robust Cluster Analysis and Variable Selection. Vol. 137 of Chapman & Hall/CRC Monographs on Statistics & Applied Probability (2015) · Zbl 1341.62037
[84] Ruwet, The influence function of the tclust robust clustering procedure, Advances in Data Analysis and Classification 6 pp 107– (2012) · Zbl 1255.62182
[85] Ruwet, On the breakdown behavior of the tclust clustering procedure, Test 22 pp 466– (2013) · Zbl 1273.62146
[86] Schwarz, Estimating the dimension of a model, The Annals of Statistics 6 pp 461– (1978) · Zbl 0379.62005
[87] Stephens, Dealing with label switching in mixture models, Journal of the Royal Statistical Society. Series B: Statistical Methodology 62 pp 795– (2000) · Zbl 0957.62020
[88] Teicher, Identifiability of finite mixtures, Annals of Mathematical Statistics 34 pp 1265– (1963) · Zbl 0137.12704
[89] Tukey, Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. Stanford Studies in Mathematics and Statistics pp 448– (1960)
[90] Verdinelli, Bayesian analysis of outlier problems using the Gibbs sampler, Statistics and Computing 1 pp 105– (1991)
[91] Wolfe , J. H. 1965 A computer program for the maximum likelihood analysis of types. Technical Bulletin 65-15, U.S. Naval Personnel Research Activity
[92] Yakowitz, On the identifiability of finite mixtures, The Annals of Mathematical Statistics 39 pp 209– (1968) · Zbl 0155.25703
[93] Yao, Model based labeling for mixture models, Statistics and Computing 22 pp 337– (2012) · Zbl 1322.62047
[94] Yao, Robust mixture regression using the t-distribution, Computational Statistics and Data Analysis 71 pp 116– (2014) · Zbl 1471.62227
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.