×

Asymmetric clusters and outliers: mixtures of multivariate contaminated shifted asymmetric Laplace distributions. (English) Zbl 1507.62136

Summary: Mixtures of multivariate contaminated shifted asymmetric Laplace distributions are developed for handling asymmetric clusters in the presence of outliers (also referred to as bad points herein). In addition to the parameters of the related non-contaminated mixture, for each (asymmetric) cluster, our model has one parameter controlling the proportion of outliers and another specifying the degree of contamination. Crucially, these parameters do not have to be specified a priori, adding a flexibility to our approach that is absent from other approaches such as trimming. Moreover, each observation is given an a posteriori probability of belonging to a particular cluster, and of being an outlier or not; advantageously, this allows for the automatic detection of outliers. An expectation-conditional maximization algorithm is outlined for parameter estimation and various implementation issues are discussed. The behavior of the proposed model is investigated, and compared with well-established finite mixture approaches, on artificial and real data.

MSC:

62-08 Computational methods for problems pertaining to statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F10 Point estimation
62H10 Multivariate distribution of statistics
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Abreu, N. G., Analise do perfil do cliente recheio e desenvolvimento de um sistema promocional (2011), Mestrado em Marketing, ISCTE-IUL, Lisbon, (Ph.D. thesis)
[2] Aitken, A.C., 1926. On Bernoulli’s numerical solution of algebraic equations. In: Proceedings of the Royal Society of Edinburgh, vol. 46, pp. 289-305.; Aitken, A.C., 1926. On Bernoulli’s numerical solution of algebraic equations. In: Proceedings of the Royal Society of Edinburgh, vol. 46, pp. 289-305. · JFM 52.0098.05
[3] Aitkin, M.; Wilson, G. T., Mixture models, outliers, and the EM algorithm, Technometrics, 22, 3, 325-331 (1980) · Zbl 0466.62034
[4] Altman, E. I., Financial ratios, discriminant analysis and the prediction of corporate bankruptcy, J. Finance, 23, 4, 589-609 (1968)
[5] Andrews, J. L.; McNicholas, P. D., Extending mixtures of multivariate \(t\)-factor analyzers, Stat. Comput., 21, 3, 361-373 (2011) · Zbl 1255.62171
[6] Andrews, J. L.; McNicholas, P., Model-based clustering, classification, and discriminant analysis via mixtures of multivariate \(t\)-distributions: the \(t\) eigen family, Stat. Comput., 22, 5, 1021-1029 (2012) · Zbl 1252.62062
[7] Azzalini, A., The skew-normal distribution and related multivariate families, Scand. J. Stat., 32, 2, 159-188 (2005) · Zbl 1091.62046
[8] Azzalini, A.; Capitanio, A., The skew-normal and related families, (IMS Monographs, vol. 3 (2014), Cambridge University Press) · Zbl 0924.62050
[9] Bagnato, L.; Punzo, A., Finite mixtures of unimodal beta and gamma densities and the \(k\)-bumps algorithm, Comput. Stat., 28, 4, 1571-1597 (2013) · Zbl 1306.65024
[10] Bagnato, L.; Punzo, A.; Zoia, M. G., The multivariate leptokurtic-normal distribution and its application in model-based clustering, Canad. J. Statist., 45, 1, 95-119 (2017) · Zbl 1462.62308
[11] Banfield, J. D.; Raftery, A. E., Model-Based Gaussian and Non-Gaussian Clustering, Biometrics, 49, 3, 803-821 (1993) · Zbl 0794.62034
[12] Basso, R. M.; Lachos, V. H.; Cabral, C. R.B.; Ghosh, P., Robust mixture modeling based on scale mixtures of skew-normal distributions, Comput. Statist. Data Anal., 54, 12, 2926-2941 (2010) · Zbl 1284.62193
[13] Berger, J.; Berliner, L., Robust Bayes and empirical Bayes analysis with \(\varepsilon \)-contaminated priors, Ann. Statist., 14, 2, 461-486 (1986) · Zbl 0602.62004
[14] Berkane, M.; Bentler, P. M., Estimation of contamination parameters and identification of outliers in multivariate data, Sociol. Methods Res., 17, 1, 55-64 (1988)
[15] Biernacki, C.; Celeux, G.; Govaert, G., Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models, Comput. Statist. Data Anal., 41, 3-4, 561-575 (2003) · Zbl 1429.62235
[16] Böhning, D.; Dietz, E.; Schaub, R.; Schlattmann, P.; Lindsay, B. G., The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family, Ann. Inst. Statist. Math., 46, 2, 373-388 (1994) · Zbl 0802.62017
[17] Brazier, S.; Sparks, R. S.J.; Carey, S. N.; Sigurdsson, H.; Westgate, J. A., Bimodal Grain Size Distribution and Secondary Thickening in Air-Fall Ash Layers, Nature, 301, 115-119 (1983)
[18] Browne, R. P.; McNicholas, P. D., A mixture of generalized hyperbolic distributions, Canad. J. Statist., 43, 2, 176-198 (2015) · Zbl 1320.62144
[19] Cabral, C. S.B.; Lachos, V. H.; Prates, M. O., Multivariate mixture modelling using skew-normal independent distributions, Comput. Statist. Data Anal., 56, 126-142 (2012) · Zbl 1239.62058
[20] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Pattern Recognit., 28, 5, 781-793 (1995)
[21] Dang, U. J.; Browne, R. P.; McNicholas, P. D., Mixtures of multivariate power exponential distributions, Biometrics, 71, 4, 1081-1089 (2015) · Zbl 1419.62330
[22] Dang, U. J.; Punzo, A.; Mcnicholas, P. D.; Ingrassia, S.; Browne, R. P., Multivariate response and parsimony for Gaussian cluster-weighted models, J. Classification, 34, 1, 4-34 (2017) · Zbl 1364.62149
[23] Dasgupta, A.; Raftery, A. E., Detecting features in spatial point processes with clutter via model-based clustering, J. Amer. Statist. Assoc., 93, 441, 294-302 (1998) · Zbl 0906.62105
[24] Dharmadhikari, S.; Joag-Dev, K., Unimodality, convexity, and applications, Probability and Mathematical Statistics (1988), Elsevier Science · Zbl 0646.62008
[25] Fraley, C.; Raftery, A. E., Model-based clustering, discriminant analysis, and density estimation, J. Amer. Statist. Assoc., 97, 458, 611-631 (2002) · Zbl 1073.62545
[26] Franczak, B. C., Mixtures of shifted asymmetric Laplace distributions (2014), University of Guelph, (Ph.D. thesis)
[27] Franczak, B. C.; Browne, R. P.; McNicholas, P. D., Mixtures of shifted asymmetric Laplace distributions, IEEE Trans. Pattern Anal. Mach. Intell., 36, 6, 1149-1157 (2014)
[28] Gallaugher, M. P.B.; McNicholas, P. D., Finite mixtures of skewed matrix variate distributions, Pattern Recognit., 80, 83-93 (2018)
[29] Gallaugher, M. P.B.; McNicholas, P. D., Three skewed matrix variate distributions, Statist. Probab. Lett., 145, 3, 103-109 (2019) · Zbl 1414.62173
[30] Gallegos, M. T.; Ritter, G., Trimmed ML estimation of contaminated mixtures, Sankhyā, 71, 2, 164-220 (2009) · Zbl 1193.62021
[31] Gutierrez, R. G.; Carroll, R. J.; Wang, N.; Lee, G. H.; Taylor, B. H., Analysis of tomato root initiation using a normal mixture distribution, Biometrics, 51, 4, 1461-1468 (1995) · Zbl 0875.62505
[32] Hall, B., Hall, M., 2017. LaplacesDemon: complete environment for Bayesian inference. Version 16.1.0.; Hall, B., Hall, M., 2017. LaplacesDemon: complete environment for Bayesian inference. Version 16.1.0.
[33] Hubert, L.; Arabie, P., Comparing partitions, J. Classification, 2, 1, 193-218 (1985)
[34] Karlis, D.; Santourian, A., Model-based clustering with non-elliptically contoured distributions, Stat. Comput., 19, 1, 73-83 (2009)
[35] Karlis, D.; Xekalaki, E., Choosing initial values for the EM algorithm for finite mixtures, Comput. Statist. Data Anal., 41, 3-4, 577-590 (2003) · Zbl 1429.62082
[36] Kass, R. E.; Raftery, A. E., Bayes factors, J. Amer. Statist. Assoc., 90, 773-795 (1995) · Zbl 0846.62028
[37] Keribin, C., Consistent estimation of the order of mixture models, Sankhyā, 62, 1, 49-66 (2000) · Zbl 1081.62516
[38] Kotz, S.; Kozubowski, T.; Podgorski, K., (The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance. The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance, SpringerLink: Bücher (2012), Birkhäuser Boston) · Zbl 0977.62003
[39] Lachos, V. H.; Ghosh, P.; Arellano-Valle, R. B., Likelihood based inference for skew-normal independent linear mixed models, Statist. Sinica, 20, 1, 303-322 (2010) · Zbl 1186.62071
[40] Lachos, V. H.; Labra, F. V., Multivariate skew-normal/independent distributions: properties and inference, Pro Mathematica, 28, 56, 11-53 (2014)
[41] Lee, S. X.; McLachlan, G. J., Finite mixtures of multivariate skew \(t\)-distributions: some recent and new results, Stat. Comput., 24, 2, 181-202 (2014) · Zbl 1325.62107
[42] Lin, T. I., Maximum likelihood estimation for multivariate skew normal mixture models, J. Multivariate Anal., 100, 2, 257-265 (2009) · Zbl 1152.62034
[43] Lin, T. I., Robust mixture modeling using multivariate skew \(t\)-distributions, Stat. Comput., 20, 343-356 (2010)
[44] Lin, T. I., Learning from incomplete data via parameterized \(t\) mixture models through eigenvalue decomposition, Comput. Statist. Data Anal., 71, 183-195 (2014) · Zbl 1471.62120
[45] Lin, T. I.; Ho, H. J.; Shen, P. S., Computationally efficient learning of multivariate \(t\) mixture models with missing information, Comput. Stat., 24, 3, 375-392 (2009) · Zbl 1189.62095
[46] Lin, T. I.; Lee, J. C.; Ho, H. J., On fast supervised learning for normal mixture models with missing information, Pattern Recognit., 39, 6, 1177-1187 (2006) · Zbl 1096.68723
[47] Lin, T. I.; Lee, J. C.; Yen, S. Y., Finite mixture modelling using the skew normal distribution, Statist. Sinica, 17, 3, 909-927 (2007) · Zbl 1133.62012
[48] Lin, T. I.; Wang, W. L.; McLachlan, G. J.; Lee, S. X., Robust mixtures of factor analysis models using the restricted multivariate skew-\(t\) distribution, Stat. Model., 18, 1, 50-72 (2018) · Zbl 07289498
[49] Lo, K.; Gottardo, R., Flexible mixture modeling via the multivariate \(t\) distribution with the Box-Cox transformation: an alternative to the skew-\(t\) distribution, Stat. Comput., 22, 1, 33-52 (2012) · Zbl 1322.62173
[50] Maruotti, A.; Punzo, A., Model-based time-varying clustering of multivariate longitudinal data with covariates and outliers, Comput. Statist. Data Anal., 113, 475-496 (2017) · Zbl 1464.62128
[51] Mazza, A.; Punzo, A., Mixtures of multivariate contaminated normal regression models, Statist. Papers (2018) · Zbl 1435.62238
[52] McLachlan, G. J.; Basford, K. E., Mixture Models - Inference and Applications to Clustering, 254 (1988), Marcel Dekker: Marcel Dekker New York · Zbl 0697.62050
[53] McLachlan, G. J.; Bean, R. W.; Jones, L. B.T., Extension of the mixture of factor analyzers model to incorporate the multivariate \(t\)-distribution, Comput. Statist. Data Anal., 51, 11, 5327-5338 (2007) · Zbl 1445.62053
[54] McLachlan, G.; Krishnan, T., The EM algorithm and extensions, ((2008), Hoboken: Hoboken New Jersey: Wiley) · Zbl 1165.62019
[55] McLachlan, G. J.; Peel, D., Finite Mixture Models, 419 (2000), John Wiley & Sons: John Wiley & Sons New York · Zbl 0963.62061
[56] McLachlan, G. J.; Peel, D.; Bean, R. W., Modelling high-dimensional data by mixtures of factor analyzers, Comput. Statist. Data Anal., 41, 3-4, 379-388 (2003) · Zbl 1256.62036
[57] McNicholas, P. D., Mixture model-based classification (2016), Chapman & Hall/CRC Press: Chapman & Hall/CRC Press Boca Raton
[58] McNicholas, P. D., Model-Based clustering, J. Classification, 33, 3, 331-373 (2016) · Zbl 1364.62155
[59] McNicholas, S. M.; McNicholas, P. D.; Browne, R. P., A mixture of variance-gamma factor analyzers, (Ahmed, S. E., Big and Complex Data Analysis: Methodologies and Applications (2017), Springer International Publishing: Springer International Publishing Cham), 369-385 · Zbl 1381.62187
[60] McNicholas, P. D.; Murphy, T. B., Parsimonious Gaussian mixture models, Stat. Comput., 18, 285-296 (2008)
[61] McNicholas, P. D.; Murphy, T. B.; McDaid, A. F.; Frost, D., Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models, Comput. Statist. Data Anal., 54, 3, 711-723 (2010) · Zbl 1464.62131
[62] Melnykov, V.; Melnykov, I., Initializing the em algorithm in Gaussian mixture models with an unknown number of components, Comput. Statist. Data Anal., 56, 6, 1381-1395 (2012) · Zbl 1246.65025
[63] Meng, X. L.; Rubin, D. B., Maximum likelihood estimation via the ECM algorithm: A general framework, Biometrika, 80, 2, 267-278 (1993) · Zbl 0778.62022
[64] Murphy, E. A., One Cause? Many Causes? The Argument from the Bimodal Distribution, J. Chronic Dis., 17, 4, 301-324 (1964)
[65] Murray, P. M.; Browne, R. B.; McNicholas, P. D., Mixtures of skew-t factor analyzers, Comput. Statist. Data Anal., 77, 326-335 (2014) · Zbl 1506.62132
[66] Murray, P. M.; Browne, R. B.; McNicholas, P. D., Hidden truncation hyperbolic distributions, finite mixtures thereof, and their application for clustering, J. Multivariate Anal., 161, 141-156 (2017) · Zbl 1403.62028
[67] Murray, P. M.; Browne, R. P.; McNicholas, P. D., A mixture of SDB skew-\(t\) factor analyzers, Econom. Stat., 3, 160-168 (2017)
[68] O’Hagan, A.; Murphy, T. B.; Gormley, I. C.; McNicholas, P. D.; Karlis, D., Clustering with the multivariate normal inverse Gaussian distribution, Comput. Statist. Data Anal., 93, 18-30 (2016) · Zbl 1468.62151
[69] Peel, D.; McLachlan, G. J., Robust mixture modelling using the \(t\) distribution, Stat. Comput., 10, 4, 339-348 (2000)
[70] Prates, M.; Lachos, V.; Cabral, C. B., mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions, J. Stat. Softw., 54, 12, 1-20 (2013)
[71] Punzo, A., A new look at the inverse Gaussian distribution with applications to insurance and economic data, J. Appl. Stat. (2018)
[72] Punzo, A.; Bagnato, L.; Maruotti, A., Compound unimodal distributions for insurance losses, Insurance Math. Econom., 81, 4, 95-107 (2018) · Zbl 1416.91217
[73] Punzo, A.; Browne, R. P.; McNicholas, P. D., Hypothesis testing for mixture model selection, J. Stat. Comput. Simul., 86, 14, 2797-2818 (2016) · Zbl 07184768
[74] Punzo, A.; Maruotti, A., Clustering multivariate longitudinal observations: the contaminated Gaussian hidden Markov model, J. Comput. Graph. Statist., 25, 4, 1097-1116 (2016)
[75] Punzo, A.; Mazza, A.; Maruotti, A., Fitting insurance and economic data with outliers: a flexible approach based on finite mixtures of contaminated gamma distributions, J. Appl. Stat., 45, 14, 2563-2584 (2018) · Zbl 1516.62555
[76] Punzo, A., Mazza, A., McNicholas, P.D., 2017. ContaminatedMixt: model-based clustering and classification with the multivariate contaminated normal distribution. Version 1.1.; Punzo, A., Mazza, A., McNicholas, P.D., 2017. ContaminatedMixt: model-based clustering and classification with the multivariate contaminated normal distribution. Version 1.1.
[77] Punzo, A.; Mazza, A.; McNicholas, P. D., ContaminatedMixt: an R package for fitting parsimonious mixtures of multivariate contaminated normal distributions, J. Stat. Softw., 85, 10, 1-25 (2018)
[78] Punzo, A.; McNicholas, P. D., Robust high-dimensional modeling with the contaminated Gaussian distribution (2014), arXiv.org e-print 1408.2128. Available at:
[79] Punzo, A.; McNicholas, P. D., Parsimonious mixtures of multivariate contaminated normal distributions, Biom. J., 58, 6, 1506-1537 (2016) · Zbl 1353.62124
[80] Punzo, A.; McNicholas, P. D., Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model, J. Classification, 34, 2, 249-293 (2017) · Zbl 1373.62316
[81] Pyne, S.; Hu, X.; Wang, K.; Rossin, E.; Lin, T. I.; Maier, L. M.; Baecher-Allan, C.; McLachlan, G. J.; Tamayo, P.; Hafler, D. A.; De Jager, P. L.; Mesirov, J. P., Automated high-dimensional flow cytometric data analysis, Proc. Natl. Acad. Sci., 106, 21, 8519-8524 (2009)
[82] R: A Language and Environment for Statistical Computing (2017), R Foundation for Statistical Computing: R Foundation for Statistical Computing Vienna, Austria
[83] Raftery, A. E., Bayesian model selection in social research, Sociol. Methodol., 25, 111-163 (1995)
[84] Rand, W., Objective criteria for the evaluation of clustering methods, J. Amer. Statist. Assoc., 66, 336, 846-850 (1971)
[85] Ruwet, C.; García-Escudero, L. A.; Gordaliza, A.; Mayo-Iscar, A., The influence function of the tclust robust clustering procedure, Adv. Data Anal. Classif., 6, 2, 107-130 (2012) · Zbl 1255.62182
[86] Schork, N. J.; Schork, M. A., Skewness and mixtures of normal distributions, Comm. Statist. Theory Methods, 17, 11, 3951-3969 (1988) · Zbl 0696.62062
[87] Schwarz, G., Estimating the dimension of a model, Ann. Statist., 6, 2, 461-464 (1978) · Zbl 0379.62005
[88] Scrucca, L.; Fop, M.; Murphy, T. B.; Raftery, A. E., mclust 5: clustering, classification and density estimation using Gaussian finite mixture models, R J., 8, 1, 289-317 (2016)
[89] da Silva Ferreira, C.; Bolfarine, H.; Lachos, V. H., Skew scale mixtures of normal distributions: properties and estimation, Stat. Methodol., 8, 2, 154-171 (2011) · Zbl 1213.62023
[90] Steinley, D., Properties of the Hubert-Arable adjusted Rand index, Psychol. Methods, 9, 3, 386-396 (2004)
[91] Subedi, S.; McNicholas, P. D., Variational Bayes approximations for clustering via mixtures of normal inverse Gaussian distributions, Adv. Data Anal. Classif., 8, 2, 167-193 (2014) · Zbl 1459.62122
[92] Subedi, S.; Punzo, A.; Ingrassia, S.; McNicholas, P. D., Clustering and classification via cluster-weighted factor analyzers, Adv. Data Anal. Classif., 7, 1, 5-40 (2013) · Zbl 1271.62137
[93] Subedi, S.; Punzo, A.; Ingrassia, S.; McNicholas, P. D., Cluster-weighted \(t\)-factor analyzers for robust model-based clustering and dimension reduction, Stat. Methods Appl., 24, 4, 623-649 (2015) · Zbl 1416.62362
[94] Tang, Y.; Browne, R.; McNicholas, P., Flexible clustering of high-dimensional data via mixtures of joint generalized hyperbolic distributions, Stat, 7, 1, Article e177 pp. (2018)
[95] Titterington, D. M.; Smith, A. F.M.; Makov, U. E., Statistical analysis of finite mixture distributions, 237 (1985), John Wiley & Sons: John Wiley & Sons New York · Zbl 0646.62013
[96] Vrbik, I.; McNicholas, P. D., Analytic calculations for the EM algorithm for multivariate skew-mixture models, Statist. Probab. Lett., 82, 6, 1169-1174 (2012) · Zbl 1244.65012
[97] Vrbik, I.; McNicholas, P. D., Parsimonious skew mixture models for model-based clustering and classification, Comput. Statist. Data Anal., 71, 196-210 (2014) · Zbl 1471.62202
[98] Wand, M., 2015. KernSmooth: Functions for Kernel Smoothing Supporting Wand & Jones (1995). Version 2.23-15.; Wand, M., 2015. KernSmooth: Functions for Kernel Smoothing Supporting Wand & Jones (1995). Version 2.23-15.
[99] Wang, W. L.; Castro, L. M.; Chang, Y. T.; Lin, T. I., Mixtures of restricted skew-\(t\) factor analyzers with common factor loadings, Adv. Data Anal. Classif. (2018)
[100] Wang, W. L.; Lin, T. I., Flexible clustering via extended mixtures of common \(t\)-factor analyzers, AStA Adv. Stat. Anal., 101, 3, 227-252 (2017) · Zbl 1443.62177
[101] Wang, W. L.; Liu, M.; Lin, T. I., Robust skew-\(t\) factor analysis models for handling missing data, Stat. Methods Appl., 26, 4, 649-672 (2017) · Zbl 1441.62161
[102] Wang, K.; Ng, S. K.; McLachlan, G. J., Multivariate skew \(t\) mixture models: applications to fluorescence-activated cell sorting data, (Digital Image Computing: Techniques and Applications (2009), IEEE: IEEE Los Alamitos, California)
[103] Wei, Y.; Tang, Y.; McNicholas, P. D., Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data, Comput. Statist. Data Anal., 130, 18-41 (2019) · Zbl 1469.62162
[104] Zhang, J.; Liang, F., Robust clustering using exponential power mixtures, Biometrics, 66, 4, 1078-1086 (2010) · Zbl 1233.62192
[105] Zhu, X., Melnykov, V., 2017. ManlyMix: Manly mixture modeling and model-based clustering. Version 0.1.11.; Zhu, X., Melnykov, V., 2017. ManlyMix: Manly mixture modeling and model-based clustering. Version 0.1.11.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.