×

Multiple scaled contaminated normal distribution and its application in clustering. (English) Zbl 07381254

Summary: The multivariate contaminated normal (MCN) distribution represents a simple heavy-tailed generalization of the multivariate normal (MN) distribution to model elliptical contoured scatters in the presence of mild outliers (also referred to as ‘bad’ points herein) and automatically detect bad points. The price of these advantages is two additional parameters: proportion of good observations and degree of contamination. However, in a multivariate setting, only one proportion of good observations and only one degree of contamination may be limiting. To overcome this limitation, we propose a multiple scaled contaminated normal (MSCN) distribution. Among its parameters, we have an orthogonal matrix \(\Gamma\). In the space spanned by the vectors (principal components) of \(\Gamma\), there is a proportion of good observations and a degree of contamination for each component. Moreover, each observation has a posterior probability of being good with respect to each principal component. Thanks to this probability, the method provides directional robust estimates of the parameters of the nested MN and automatic directional detection of bad points. The term ‘directional’ is added to specify that the method works separately for each principal component. Mixtures of MSCN distributions are also proposed, and an expectation-maximization algorithm is used for parameter estimation. Real and simulated data are considered to show the usefulness of our mixture with respect to well-established mixtures of symmetric distributions with heavy tails.

MSC:

62-XX Statistics
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Aitken, AC (1926) On Bernoulli’s numerical solution of algebraic equations. Proceedings of the Royal Society of Edinburgh 46, 289-305. · JFM 52.0098.05 · doi:10.1017/S0370164600022070
[2] Aitkin, M, Wilson, GT (1980). Mixture models, outliers, and the EM algorithm. Technometrics, 22, 325-31. · Zbl 0466.62034 · doi:10.1080/00401706.1980.10486163
[3] Akaike, H (1973) Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory edited by Petrov, BN, Csaki, F pages 267-81. New York NY: Springer-Verlag. · Zbl 0283.62006
[4] Alqallaf, F, Van Aelst, S, Yohai, VJ, Zamar, RH (2009) Propagation of outliers in multivariate data. The Annals of Statistics 37, 311-31. · Zbl 1155.62043 · doi:10.1214/07-AOS588
[5] Andrews, JL, McNicholas, PD (2011) Extending mixtures of multivariate t-factor analyzers. Statistics and Computing 21, 361-73. · Zbl 1255.62175 · doi:10.1007/s11222-010-9175-2
[6] Andrews, J, Wickins, J, Boers, N, McNicholas, P (2018) teigen: An R package for model-based clustering and classification via the multivariate t distribution. Journal of Statistical Software 83, 1-32. · doi:10.18637/jss.v083.i07
[7] Bagnato, L, Punzo, A (2013) Finite mixtures of unimodal beta and gamma densities and the k-bumps algorithm. Computational Statistics 28, 1571-97. · Zbl 1306.65024 · doi:10.1007/s00180-012-0367-4
[8] ——— (2019) Unconstrained representation of orthogonal matrices with application to common principle components. ArXiv.org e-print 1906.00587. URL http://arxiv.org/abs/1906.00587
[9] Bagnato, L, Punzo, A, Zoia, MG (2017) The multivariate leptokurtic-normal distribution and its application in model-based clustering. Canadian Journal of Statistics 45, 95-119. · Zbl 1462.62308 · doi:10.1002/cjs.11308
[10] Berkane, M, Bentler, PM (1988) Estimation of contamination parameters and identification of outliers in multivariate data. Sociological Methods & Research 17, 55-64.
[11] Biernacki, C, Celeux, G, Govaert, G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis 41, 561-75. · Zbl 1429.62235 · doi:10.1016/S0167-9473(02)00163-9
[12] Bock, HH (2002) Clustering methods: From classical models to new approaches. Statistics in Transition 5, 725-58.
[13] Böhning, D (2000) Computer-assisted Analysis of Mixtures and Applications: Meta-analysis Disease Mapping and Others (Monographs on Statistics and Applied Probability 81). London: Chapman & Hall/CRC. · Zbl 0951.62088 · doi:10.1080/00401706.2000.10485740
[14] Böhning, D, Dietz, E, Schaub, R, Schlattmann, P, Lindsay, B (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Statistical Mathematics 46, 373-88. · Zbl 0802.62017 · doi:10.1007/BF01720593
[15] Box, GEP (1980) Sampling and Bayes’ inference in scientific modelling and robustness. Journal of the Royal Statistical Society: Series A (Statistics in Society) 143, 383-430. · Zbl 0471.62036 · doi:10.2307/2982063
[16] Box, GEP, Tiao, GC (2011) Bayesian Inference in Statistical Analysis. New York: Wiley Classics Library.
[17] Browne, RP, ElSherbiny, A, McNicholas, PD (2018) mixture: Finite Gaussian Mixture Models for Clustering and Classification. R package Version 1.5. URL http://CRAN.R-project.org/package=mixture
[18] Cabral, CSB, Lachos, VH, Prates, MO (2012) Multivariate mixture modelling using skew-normal independent distributions. Computational Statistics & Data Analysis 56, 126-42. · Zbl 1239.62058 · doi:10.1016/j.csda.2011.06.026
[19] Dang, UJ, Browne, RP, McNicholas, PD (2015) Mixtures of multivariate power exponential distributions. Biometrics 71, 1081-89. · Zbl 1419.62330 · doi:10.1111/biom.12351
[20] Davies, L, Gather, U (1993). The identification of multiple outliers. Journal of the American Statistical Association, 88, 782-92. · Zbl 0797.62025 · doi:10.1080/01621459.1993.10476339
[21] Dempster, A, Laird, N, Rubin, D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39, 1-38. · Zbl 0364.62022 · doi:10.1111/j.2517-6161.1977.tb01600.x
[22] Devlin, SJ, Gnanadesikan, R, Kettenring, JR (1981) Robust estimation of dispersion matrices and principal components. Journal of the American Statistical Association 76, 354-62. · Zbl 0463.62031 · doi:10.1080/01621459.1981.10477654
[23] Farcomeni, A (2014a) Robust constrained clustering in presence of entry-wise outliers. Technometrics 56, 102-11. · doi:10.1080/00401706.2013.826148
[24] Farcomeni, A (2014b) Snipping for robust k-means clustering under component-wise contamination. Statistics and Computing 24, 907-19. · Zbl 1332.62203 · doi:10.1007/s11222-013-9410-8
[25] Farcomeni, A, Greco, L (2016) Robust Methods for Data Reduction. Boca Raton FL: CRC Press. · Zbl 1311.62006 · doi:10.1201/b18358
[26] Farcomeni, A, Punzo, A (2019) Robust model-based clustering with mild and gross outliers. TEST URL: https://doi.org/10.1007/s11749-019-00693-z · Zbl 1474.62222
[27] Forbes, F, Wraith, D (2014) A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: Application to robust clustering. Statistics and Computing 24, 971-84. · Zbl 1332.62204 · doi:10.1007/s11222-013-9414-4
[28] Fraley, C, Raftery, AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal 41, 578-88. · Zbl 0920.68038 · doi:10.1093/comjnl/41.8.578
[29] Franczak, BC, Browne, RP, McNicholas, PD (2014) Mixtures of shifted asymmetric Laplace distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 1149-57. · doi:10.1109/TPAMI.2013.216
[30] Franczak, BC, Tortora, C, Browne, RP, McNicholas, PD (2015) Unsupervised learning via mixtures of skewed distributions with hypercube contours. Pattern Recognition Letters 58, 69-76. · doi:10.1016/j.patrec.2015.02.011
[31] Fritz, H, Garcia-Escudero, LA, Mayo-Iscar, A (2012) tclust: An R package for a trimming approach to cluster analysis. Journal of Statistical Software 47, 1-26. · doi:10.18637/jss.v047.i12
[32] Gallegos, MT, Ritter, G (2009) Trimmed ML estimation of contaminated mixtures. Sankhy?: The Indian Journal of Statistics A 71, 164-220. · Zbl 1193.62021
[33] Garcia-Escudero, LA, Gordaliza, A, Matran, C, Mayo-Iscar, A (2008) A general trimming approach to robust cluster analysis. Annals of Statistics 36, 1324-45. · Zbl 1360.62328 · doi:10.1214/07-AOS515
[34] Gómez, E, Gómez-Villegas, MA, Marín, JM (2003) A survey on continuous elliptical vector distributions. Revista Matemática Complutense 16, 345-61. · Zbl 1041.60016
[35] Gómez-Villegas, MA, Gómez-Sánchez-Manzano, E, Maín, P, Navarro, H (2011). The effect of non-normality in the power exponential distributions. In Modern Mathematical Tools and Techniques in Capturing Complexity, Understanding Complex Systems, edited by Pardo, L, Balakrishnan, N, Gil, MA pages 119-29. Berlin and Heidelberg: Springer-Verlag. · doi:10.1007/978-3-642-20853-9_9
[36] Hogg, RV (1974) Adaptive robust procedures: A partial review and some suggestions for future applications and theory. Journal of the American Statistical Association 69, 909-23. · Zbl 0305.62030 · doi:10.1080/01621459.1974.10480225
[37] Hubert, L, Arabie, P (1985) Comparing partitions. Journal of Classification 2, 193-218. · Zbl 0587.62128 · doi:10.1007/BF01908075
[38] Karlis, D, Xekalaki, E (2003) Choosing initial values for the EM algorithm for finite mixtures. Computational Statistics & Data Analysis 41, 577-90. · Zbl 1429.62082 · doi:10.1016/S0167-9473(02)00177-9
[39] Kaufman, L, Rousseeuw, PJ (Eds) (1990) Partitioning around medoids (program PAM). In Finding groups in data: An introduction to cluster analysis pages 68-125. Hoboken New Jersey: Wiley. · Zbl 1345.62009 · doi:10.1002/9780470316801.ch2
[40] Kotz, S, Nadarajah, S (2004) Multivariate t-Distributions and Their Applications. Cambridge: Cambridge University Press. · Zbl 1100.62059 · doi:10.1017/CBO9780511550683
[41] Lange, KL, Little, RJA, Taylor, JMG (1989) Robust statistical modeling using the \(t\) distribution. Journal of the American Statistical Association 84, 881-96. · doi:10.1080/01621459.1989.10478852
[42] Lindsay, B (1995) Mixture Models: Theory Geometry and Applications (NSF-CBMS Regional Conference Series in Probability and Statistics Volume 5). Hayward CA: Institute of Mathematical Statistics. · Zbl 1163.62326 · doi:10.1214/cbms/1462106013
[43] Little, RJA (1988) Robust estimation of the mean and covariance matrix from data with missing values. Applied Statistics 37, 23-8. · Zbl 0647.62040 · doi:10.2307/2347491
[44] Maechler, M, Rousseeuw, P, Struyf, A, Hubert, M (2018) cluster: ’Finding groups in data’—Cluster analysis extended Rousseeuw et al. R package Version 2.0.7-1. URL https://CRAN.R-project.org/package=cluster
[45] Maronna, RA (1976) Robust M-estimators of multivariate location and scatter. The Annals of Statistics 4, 51-67. · Zbl 0322.62054 · doi:10.1214/aos/1176343347
[46] Maronna, RA, Yohai, VJ (2014). Robust Estimation of Multivariate Location and Scatter. John Wiley & Sons.
[47] Maruotti, A, Punzo, A (2017) Model-based time-varying clustering of multivariate longitudinal data with covariates and outliers. Computational Statistics & Data Analysis 113, 475-96. · Zbl 1464.62128 · doi:10.1016/j.csda.2016.05.024
[48] Mazza, A, Punzo, A (2017) Mixtures of multivariate contaminated normal regression models. Statistical Papers. URL: https://doi.org/10.1007/s00362-017-0964-y · Zbl 1435.62238
[49] McLachlan, GJ, Basford, KE (1988) Mixture models: Inference and Applications to clustering. New York NY: Marcel Dekker. · Zbl 0697.62050
[50] McLachlan, GJ, Bean, RW, Ben-Tovim Jones, L (2007) Extension of the mixture of factor analyzers model to incorporate the multivariate \(t\)-distribution. Computational Statistics & Data Analysis 51, 5327-38. · Zbl 1445.62053 · doi:10.1016/j.csda.2006.09.015
[51] McLachlan, GJ, Peel, D (2000) Finite Mixture Models. New York NY: John Wiley & Sons. · Zbl 0963.62061 · doi:10.1002/0471721182
[52] McLachlan, GJ, Peel, D, Bean, RW (2003) Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics & Data Analysis 41, 379-88. · Zbl 1256.62036 · doi:10.1016/S0167-9473(02)00183-4
[53] McNicholas, PD (2016) Mixture Model-Based Classification. Boca Raton FL: Chapman and Hall/CRC Press. · Zbl 1454.62005 · doi:10.1201/9781315373577
[54] McNicholas, PD, Murphy, TB (2008) Parsimonious Gaussian mixture models. Statistics and Computing 18, 285-96. · doi:10.1007/s11222-008-9056-0
[55] McNicholas, PD, Murphy, TB, McDaid, AF, Frost, D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Computational Statistics & Data Analysis 54, 711-23. · Zbl 1464.62131 · doi:10.1016/j.csda.2009.02.011
[56] Morris, K., Punzo, A., McNicholas, P. D., Browne, R. P. (2019). Asymmetric clusters and outliers: Mixtures of multivariate contaminated shifted asymmetric Laplace distributions. Computational Statistics & Data Analysis, 132, 145-66. · Zbl 1507.62136 · doi:10.1016/j.csda.2018.12.001
[57] Peel, D, McLachlan, GJ (2000) Robust mixture modelling using the \(t\) distribution. Statistics and Computing 10, 339-48. · doi:10.1023/A:1008981510081
[58] Punzo, A, Blostein, M, McNicholas, PD (2020) High-dimensional unsupervised classification via parsimonious contaminated mixtures. Pattern Recognition 98: 107031. · doi:10.1016/j.patcog.2019.107031
[59] Punzo, A, Ingrassia, S, Maruotti, A (2019) Multivariate hidden Markov regression models: Random covariates and heavy-tailed distributions. Statistical Papers to appear. URL: https://doi.org/10.1007/s00362-019-01146-3. · Zbl 1477.62224
[60] Punzo, A, Maruotti, A (2016) Clustering multivariate longitudinal observations: The contaminated Gaussian hidden Markov model. Journal of Computational and Graphical Statistics 25, 1097-116. · doi:10.1080/10618600.2015.1089776
[61] Punzo, A, Mazza, A, Maruotti, A (2018a) Fitting insurance and economic data with outliers: A flexible approach based on finite mixtures of contaminated gamma distributions. Journal of Applied Statistics 45, 2563-84. · Zbl 1516.62555 · doi:10.1080/02664763.2018.1428288
[62] Punzo, A, Mazza, A, McNicholas, P (2018b) ContaminatedMixt: An R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. Journal of Statistical Software 85, 1-25. · doi:10.18637/jss.v085.i10
[63] Punzo, A, McNicholas, PD (2016) Parsimonious mixtures of multivariate contaminated normal distributions. Biometrical Journal 58, 1506-37. · Zbl 1353.62124 · doi:10.1002/bimj.201500144
[64] ——— (2017) Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. Journal of Classification 34, 249-93. · Zbl 1373.62316 · doi:10.1007/s00357-017-9234-x
[65] Rand, WM (1971) Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846-50. · doi:10.1080/01621459.1971.10482356
[66] Ritter, G (2015) Robust Cluster Analysis and Variable Selection (Chapman & Hall/CRC Monographs on Statistics & Applied Probability Volume 137). Boca Raton FL: CRC Press. · Zbl 1341.62037
[67] Schwarz, G (1978) Estimating the dimension of a model. The Annals of Statistics 6, 461-64. · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[68] Stephens, M (2000) Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62, 795-809. · Zbl 0957.62020 · doi:10.1111/1467-9868.00265
[69] Subedi, S, Punzo, A, Ingrassia, S, McNicholas, PD (2013) Clustering and classification via cluster-weighted factor analyzers. Advances in Data Analysis and Classification 7, 5-40. · Zbl 1271.62137 · doi:10.1007/s11634-013-0124-8
[70] ——— (2015) Cluster-weighted t-factor analyzers for robust model-based clustering and dimension reduction. Statistical Methods & Applications 24, 623-49. · Zbl 1416.62362 · doi:10.1007/s10260-015-0298-7
[71] Tortora, C, Franczak, B, Browne, R, McNicholas, P (2019). A mixture of coalesced generalized hyperbolic distributions. Journal of Classification, 36, 26-57. · Zbl 1433.62172 · doi:10.1007/s00357-019-09319-3
[72] Tukey, JW (1960) A survey of sampling from contaminated distributions. In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling (Stanford Studies in Mathematics and Statistics) edited by Olkin, I . Chapter 39 pages 448-85. Stanford CA: Stanford University Press. · Zbl 0201.52803
[73] Zhang, J, Liang, F (2010). Robust clustering using exponential power mixtures. Biometrics, 66, 1078-86. · Zbl 1233.62192 · doi:10.1111/j.1541-0420.2010.01389.x
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.