Bayesian clustering of skewed and multimodal data using geometric skewed normal distributions. (English) Zbl 07345888

Summary: Model-based clustering approaches generally assume that the observations to be clustered are generated from a mixture of distributions, each component of the mixture corresponding to a particular parametric distribution. Most commonly, the underlying distribution is assumed to be normal, which is inadequate for many situations, for example when skewness or multimodality is present within the components. The problem is intensified when the data dimension increases, leading to inaccurate groupings and incorrect inference. A new Bayesian model-based clustering approach is proposed, that can handle a variety of complexities in the data, based on a recently introduced family of geometric skew normal distributions. The performance of this methodology is illustrated through a number of simulation studies and applications to a number of datasets from genomics and medicine.


62-XX Statistics


sn; BayesDA; UCI-ml
Full Text: DOI


[1] Amendola, C.; Engström, A.; Haase, C., Maximum number of modes of Gaussian mixtures, Inf. Inference: J. IMA (2019), iaz013
[2] Andrews, J. L.; McNicholas, P. D., Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions, Stat. Comput., 22, 5, 1021-1029 (2012) · Zbl 1252.62062
[3] Andrews, H.; Patterson, C. L., Singular value decomposition (SVD) image coding, IEEE Trans. Commun., 24, 4, 425-432 (1976)
[4] Argiento, R.; Cremaschi, A.; Guglielmi, A., A “density-based” algorithm for cluster analysis using species sampling Gaussian mixture models, J. Comput. Graph. Statist., 23, 4, 1126-1142 (2014)
[5] Azzalini, A., The skew-normal distribution and related multivariate families, Scand. J. Stat., 32, 2, 159-188 (2005) · Zbl 1091.62046
[6] Browne, R.; McNicholas, P., A mixture of generalized hyperbolic distributions, Can. J. Stat., 43, 2, 176-198 (2015) · Zbl 1320.62144
[7] Cheng, Y., Mean shift, mode seeking, and clustering, IEEE Trans. Pattern Anal. Mach. Intell., 17, 8, 790-799 (1995)
[8] Dempster, A. P.; Laird, N. M.; Rubin, D. B., Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Stat. Soc. B, 39, 1, 1-38 (1977) · Zbl 0364.62022
[9] Diebolt, J.; Robert, C. P., Estimation of finite mixture distributions through Bayesian sampling, J. R. Stat. Soc. Ser. B Stat. Methodol., 56, 2, 363-375 (1994) · Zbl 0796.62028
[10] Drton, M.; Plummer, M., A Bayesian information criterion for singular models, J. R. Stat. Soc. Ser. B Stat. Methodol., 79, 2, 323-380 (2017) · Zbl 1414.62088
[11] Dua, D.; Graff, C., UCI Machine Learning Repository (2017), University of California, Irvine, School of Information and Computer Sciences
[12] Einasto, M.; Vennik, J.; Nurmi, P.; Tempel, E.; Ahvensalmi, A.; Tago, E.; Liivamägi, L. J.; Saar, E.; Heinämäki, P.; Einasto, J.; Martínez, V. J., Multimodality in galaxy clusters from SDSS DR8: substructure and velocity distribution, Astron. Astrophys., 540, A123 (2012)
[13] Escobar, M. D.; West, M., Bayesian density estimation and inference using mixtures, J. Amer. Statist. Assoc., 90, 430, 577-588 (1995) · Zbl 0826.62021
[14] Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X., A density-based algorithm for discovering clusters in large spatial databases with noise, (Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96 (1996), AAAI Press), 226-231
[15] Estrada, K.; Rivadeneira, F., Genome-wide meta-analysis identifies 56 bone mineral density loci and reveals 14 loci associated with risk of fracture, Nat. Genet., 44, 5, 491-501 (2012)
[16] Everitt, B., Cluster Analysis (1974), Heinemann Educational Publishers: Heinemann Educational Publishers London
[17] Forbes, F.; Wraith, D., A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: application to robust clustering, Stat. Comput., 24, 6, 971-984 (2014) · Zbl 1332.62204
[18] Fraley, C.; Raftery, A. E., Model-based clustering, discriminant analysis, and density estimation, J. Amer. Statist. Assoc., 97, 458, 611-631 (2002) · Zbl 1073.62545
[19] Friel, N.; McKeone, J. P.; Oates, C. J.; Pettitt, A. N., Investigation of the widely applicable Bayesian information criterion, Stat. Comput., 27, 3, 833-844 (2017) · Zbl 06737700
[20] Fruhwirth-Schnatter, S.; Pyne, S., Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions, Biostatistics, 11, 2, 317-336 (2010) · Zbl 1437.62465
[21] Gelman, A.; Carlin, J. B.; Stern, H. S.; Dunson, D. B.; Vehtari, A.; Rubin, D. B., (Bayesian Data Analysis. Bayesian Data Analysis, Texts in Statistical Science Series (2014), Chapman & Hall: Chapman & Hall London)
[22] Hennig, C., Methods for merging Gaussian mixture components, Adv. Data Anal. Classif., 4, 1, 3-34 (2010) · Zbl 1306.62141
[23] Hubert, L.; Arabie, P., Comparing partitions, J. Classification, 2, 1, 193-218 (1985)
[24] Jordan, M. I.; Jacobs, R. A., Hierarchical mixtures of experts and the EM algorithm, Neural Comput., 6, 2, 181-214 (1994)
[25] Kass, R. E.; Raftery, A. E., Bayes factors, J. Amer. Statist. Assoc., 90, 430, 773-795 (1995) · Zbl 0846.62028
[26] Kundu, D., Geometric skew normal distribution, Sankhya B, 76, 2, 167-189 (2014) · Zbl 1329.62073
[27] Kundu, D., Multivariate geometric skew-normal distribution, Statistics, 51, 6, 1377-1397 (2017) · Zbl 1381.62080
[28] Lampert, A.; Tlusty, T., Resonance-induced multimodal body-size distributions in ecosystems, Proc. Natl. Acad. Sci., 110, 1, 205-209 (2013)
[29] Lee, S. X.; McLachlan, G. J., On mixtures of skew normal and skew \(t\)-distributions, Adv. Data Anal. Classif., 7, 3, 241-266 (2013) · Zbl 1273.62115
[30] Li, J., Clustering based on a multilayer mixture model, J. Comput. Graph. Statist., 14, 3, 547-568 (2005)
[31] Li, J.; Ray, S.; Lindsay, B., A nonparametric statistical approach to clustering via mode identification, J. Mach. Learn. Res., 8, 1687-1723 (2007) · Zbl 1222.62076
[32] Lin, Y.; Tseng, G. C.; Cheong, S. Y.; Bean, L. J.; Sherman, S. L.; Feingold, E., Smarter clustering methods for SNP genotype calling, Bioinformatics, 24, 23, 2665-2671 (2008)
[33] Malsiner-Walli, G.; Frühwirth-Schnatter, S.; Grün, B., Identifying mixtures of mixtures using Bayesian estimation, J. Comput. Graph. Statist., 26, 2, 285-295 (2017)
[34] Marin, J.-M.; Robert, C. P., (Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Bayesian Core: A Practical Approach to Computational Bayesian Statistics, Springer Texts in Statistics (2007), Springer-Verlag: Springer-Verlag Berlin) · Zbl 1137.62013
[35] Mascini, N. E.; Teunissen, J.; Noorlag, R.; Willems, S. M.; Heeren, R. M.A., Tumor classification with MALDI-MSI data of tissue microarrays: A case study, Methods, 151, 21-27 (2018)
[36] McLachlan, G.; Peel, D., Finite Mixture Models (2000), Wiley-Interscience: Wiley-Interscience Hoboken · Zbl 0963.62061
[37] Meilă, M., Comparing clusterings by the variation of information, (Schölkopf, B.; Warmuth, M. K., Learning Theory and Kernel Machines (2003), Springer Berlin), 173-187 · Zbl 1274.68338
[38] O’Hagan, A.; Murphy, T. B.; Gormley, I. C.; McNicholas, P. D.; Karlis, D., Clustering with the multivariate normal inverse Gaussian distribution, Comput. Statist. Data Anal., 93, 18-30 (2016) · Zbl 1468.62151
[39] Raftery, A.; Newton, M.; M. Satagopan, J.; Krivitsky, P., Estimating the integrated likelihood via posterior simulation using the harmonic mean identity, Bayesian Stat., 8, 1-45 (2007) · Zbl 1252.62038
[40] Ray, S.; Ren, D., On the upper bound of the number of modes of a multivariate normal mixture, J. Multivariate Anal., 108, 41-52 (2012) · Zbl 1238.62064
[41] Richards, J. A., Remote Sensing Digital Image Analysis: An Introduction (2012), Springer Publishing Company, Incorporated
[42] Roozegar, R.; Nadarajah, S., The power series skew normal class of distributions, Comm. Statist. Theory Methods, 46, 22, 11404-11423 (2017) · Zbl 1380.62070
[43] Schwarz, G., Estimating the dimension of a model, Ann. Statist., 6, 2, 461-464 (1978) · Zbl 0379.62005
[44] Teh, Y. W.; Jordan, M. I.; Beal, M. J.; Blei, D. M., Hierarchical Dirichlet processes, J. Amer. Statist. Assoc., 101, 476, 1566-1581 (2006) · Zbl 1171.62349
[45] Thiem, S.; Kentner, D.; Sourjik, V., Positioning of chemosensory clusters in E. coli and its relation to cell division, EMBO J., 26, 6, 1615-1623 (2007)
[46] van der Vaart, A. W., Asymptotic Statistics, Cambridge Series in Statistical and Probabilistic Mathematics (1998), Cambridge University Press: Cambridge University Press Cambridge · Zbl 0910.62001
[47] Vrbik, I.; McNicholas, P. D., Parsimonious skew mixture models for model-based clustering and classification, Comput. Statist. Data Anal., 71, 196-210 (2014) · Zbl 1471.62202
[48] Wang, K.; Ng, S.; McLachlan, G., Multivariate skew t mixture models: applications to fluorescence-activated cell sorting data, (Shi, H.; Zhang, Y.; Bottema, M.; Lovell, B.; Maede, A., Conference of Digital Image Computing: Techniques and Applications, Melbourne (2009), IEEE Computer Society: IEEE Computer Society Los Alamitos, California), 526-531
[49] Watanabe, S., Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory, J. Mach. Learn. Res., 11, 3571-3594 (2010) · Zbl 1242.62024
[50] Yang, L.; Wu, X., A new sufficient condition for identifiability of countably infinite mixtures, Metrika, 77, 3, 377-387 (2014) · Zbl 1304.62047
[51] Zio, M. D.; Guarnera, U.; Rocci, R., A mixture of mixture models for a classification problem: The unity measure error, Comput. Statist. Data Anal., 51, 5, 2573-2585 (2007) · Zbl 1161.62373
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.