×

Model-based clustering and classification with non-normal mixture distributions. (English) Zbl 1332.62209

Summary: Non-normal mixture distributions have received increasing attention in recent years. Finite mixtures of multivariate skew-symmetric distributions, in particular, the skew normal and skew \(t\)-mixture models, are emerging as promising extensions to the traditional normal and \(t\)-mixture models. Most of these parametric families of skew distributions are closely related, and can be classified into four forms under a recently proposed scheme, namely, the restricted, unrestricted, extended, and generalised forms. In this paper, we consider some of these existing proposals of multivariate non-normal mixture models and illustrate their practical use in several real applications. We first discuss the characterizations along with a brief account of some distributions belonging to the above classification scheme, then references for software implementation of EM-type algorithms for the estimation of the model parameters are given. We then compare the relative performance of restricted and unrestricted skew mixture models in clustering, discriminant analysis, and density estimation on six real datasets from flow cytometry, finance, and image analysis. We also compare the performance of mixtures of skew normal and \(t\)-component distributions with other non-normal component distributions, including mixtures with multivariate normal-inverse-Gaussian distributions, shifted asymmetric Laplace distributions and generalized hyperbolic distributions.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62E10 Characterization and structure theory of statistical distributions
62G07 Density estimation
62-07 Data analysis (statistics) (MSC2010)
62P05 Applications of statistics to actuarial sciences and financial mathematics
62H35 Image analysis in multivariate analysis
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Aghaeepour N, Finak G, Consortium TF, Consortium TD, Hoos H, Mosmann TR, Brinkman R, Gottardo R, Scheuermann RH (2013) Critical assessment of automated flow cytometry data analysis techniques. Nat Methods 10:228-238 · doi:10.1038/nmeth.2365
[2] Altman EI (1968) Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J Finance 23(4):589-609 · doi:10.1111/j.1540-6261.1968.tb00843.x
[3] Arellano-Valle RB, Azzalini A (2006) On the unification of families of skew-normal distributions. Scand J Stat 33:561-574 · Zbl 1117.62051 · doi:10.1111/j.1467-9469.2006.00503.x
[4] Arellano-Valle RB, Genton MG (2005) On fundamental skew distribtuions. J Multivar Anal 96:93-116 · Zbl 1073.62049 · doi:10.1016/j.jmva.2004.10.002
[5] Arellano-Valle RB, Genton MG (2010a) Multivariate extended skew-\[t\] t distributions and related families. Metron—special issue on ‘Skew-symmetric and flexible distributions’ 68:201-234 · Zbl 1301.62016
[6] Arellano-Valle RB, Genton MG (2010b) Multivariate unified skew-elliptical distributions. Chil J Stat 1: 17-33 · Zbl 1213.62087
[7] Arellano-Valle RB, del Pino G, Martin ES (2002) Definition and probabilistic properties of skew-distributions. Stat Probab Lett 58(2):111-121 · Zbl 1045.62046 · doi:10.1016/S0167-7152(02)00088-3
[8] Arellano-Valle RB, Branco MD, Genton MG (2006) A unified view on skewed distributions arising from selections. Can J Stat 34:581-601 · Zbl 1121.60009 · doi:10.1002/cjs.5550340403
[9] Arnold BC, Beaver RJ, Meeker WQ (1993) The nontruncated marginal of a truncated bivariate normal distribution. Psychometrika 58:471-488 · Zbl 0794.62075 · doi:10.1007/BF02294652
[10] Azzalini A (1985) A class of distributions which includes the normal ones. Scand J Stat 12:171-178 · Zbl 0581.62014
[11] Azzalini A, Capitanio A (1999) Statistical applications of the multivariate skew-normal distribution. J R Stat Soc Ser B 61(3):579-602 · Zbl 0924.62050 · doi:10.1111/1467-9868.00194
[12] Azzalini A, Capitanio A (2003) Distribution generated by perturbation of symmetry with emphasis on a multivariate skew t distribution. J R Stat Soc Ser B 65(2):367-389 · Zbl 1065.62094 · doi:10.1111/1467-9868.00391
[13] Azzalini A, Dalla Valle A (1996) The multivariate skew-normal distribution. Biometrika 83(4):715-726 · Zbl 0885.62062 · doi:10.1093/biomet/83.4.715
[14] Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49: 803-821 · Zbl 0794.62034
[15] Barndorff-Nielsen OE (1977) Exponentially decreasing distributions from the logarithm of of particle size. Proc R Soc Lond A353:401-419
[16] Basso RM, Lachos VH, Cabral CRB, Ghosh P (2010) Robust mixture modeling based on scale mixtures of skew-normal distributions. Comput Stat Data Anal 54:2926-2941 · Zbl 1284.62193 · doi:10.1016/j.csda.2009.09.031
[17] Böhning D (1999) Computer-assisted analysis of mixtures and applications: meta-analysis, disease mapping and others. Chapman and Hall/CRC Press, London
[18] Branco MD, Dey DK (2001) A general class of multivariate skew-elliptical distributions. J Multivar Anal 79:99-113 · Zbl 0992.62047 · doi:10.1006/jmva.2000.1960
[19] Browne RP, McNicholas PD (2013) A mixture of generalized hyperbolic distributions. arXiv:13051036 [statME] · Zbl 1320.62144
[20] Cabral CS, Lachos VH, Prates MO (2012) Multivariate mixture modeling using skew-normal independent distributions. Comput Stat Data Anal 56:126-142 · Zbl 1239.62058 · doi:10.1016/j.csda.2011.06.026
[21] Calò AG, Montanari A, Viroli C (2013) A hierarchical modeling approach for clustering probability density functions. Comput Stat Data Anal. doi:10.1016/j.csda.2013.04.013 · Zbl 1471.62034
[22] Charytanowicz, M.; Niewczas, J.; Kulczycki, P.; Kowalski, P.; Lukasik, S.; Zak, S.; Pietka, E. (ed.); Kawa, J. (ed.), A complete gradient clustering algorithm for features analysis of x-ray images, 15-24 (2010), Berlin · doi:10.1007/978-3-642-13105-9_2
[23] Choi P, Min I (2011) A comparison of conditional and unconditional approaches in value-at-risk estimation. J Jpn Econ Assoc 62:99-115
[24] Christoffersen PF (1998) Evaluating interval forecasts. Int Econ Rev 39:841-862 · doi:10.2307/2527341
[25] Contreras-Reyes JE, Arellano-Valle RB (2012) Growth curve based on scale mixtures of skew-normal distributions to model the age-length relationship of cardinalfish (epigonus crassicaudus). arXiv:12125180 [statAP]
[26] Cook RD, Weisberg S (1994) An introduction to regression graphics. Wiley, New York · Zbl 0925.62287 · doi:10.1002/9780470316863
[27] Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1-38 · Zbl 0364.62022
[28] Everitt BS, Hand DJ (1981) Finite mixture distributions. Chapman and Hall, London · doi:10.1007/978-94-009-5897-5
[29] Fang KT, Kotz S, Ng K (1990) Symmetric multivariate and related distributions. Chapman & Hall, London · Zbl 0699.62048 · doi:10.1007/978-1-4899-2937-2
[30] Fraley C, Raftery AE (1999) How many clusters? Which clustering methods? Answers via model-based cluster analysis. Comput J 41:578-588 · Zbl 0920.68038 · doi:10.1093/comjnl/41.8.578
[31] Franczak BC, Browne RP, McNicholas PD (2012) Mixtures of shifted asymmetric laplace distributions. arXiv:12071727 [statME] · Zbl 0794.62075
[32] Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, New York · Zbl 1108.62002
[33] Frühwirth-Schnatter S, Pyne S (2010) Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-\[t\] t distributions. Biostatistics 11:317-336 · Zbl 1437.62465 · doi:10.1093/biostatistics/kxp062
[34] Ganesalingam S, McLachlan GJ (1978) The efficiency of a linear discriminant function based on unclassified initial samples. Biometrika 65:658-662 · Zbl 0389.62045 · doi:10.1093/biomet/65.3.658
[35] González-Farás G, Domínguez-Molinz JA, Gupta AK (2004) Additive properties of skew normal random vectors. J Stat Plan Inference 126:521-534 · Zbl 1076.62052 · doi:10.1016/j.jspi.2003.09.008
[36] Gupta AK (2003) Multivariate skew-\[t\] t distribution. Statistics 37:359-363 · Zbl 1037.62045 · doi:10.1080/715019247
[37] Gupta AK, González-Faríaz G, Domínguez-Molina JA (2004) A multivariate skew normal distribution. J Multivar Anal 89:181-190 · Zbl 1036.62043
[38] Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193-218 · doi:10.1007/BF01908075
[39] Jones PN, McLachlan GJ (1989) Modelling mass-size particle data by finite mixtures. Commun Stat Theory Methods 18:2629-2646 · Zbl 0696.62379 · doi:10.1080/03610928908830054
[40] Jordan, MI; Jacobs, RA; Moody, J. (ed.); Hanson, S. (ed.); Lippmann, R. (ed.), Hierarchies of adaptive experts, 985-993 (1992), California · Zbl 0798.35122
[41] Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19:73-83 · doi:10.1007/s11222-008-9072-0
[42] Karlis D, Xekalaki E (2003) Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal 41:577-590 · Zbl 1429.62082 · doi:10.1016/S0167-9473(02)00177-9
[43] Kotz S, Kozubowski TJ, Podgórski K (2001) The Laplace distribution and generalizations: a revisit with applications to communications, economics, engineering, and finance. Birkhauser, Boston · Zbl 0977.62003 · doi:10.1007/978-1-4612-0173-1
[44] Kupiec P (1995) Techniques for verifying the accuracy of risk management models. J Deriv 3:73-84 · doi:10.3905/jod.1995.407942
[45] Lachos VH, Ghosh P, Arellano-Valle RB (2010) Likelihood based inference for skew normal independent linear mixed models. Statistica Sinica 20:303-322 · Zbl 1186.62071
[46] Lee S, McLachlan GJ (2011) On the fitting of mixtures of multivariate skew \[t\] t-distributions via the EM algorithm. arXiv:11094706 [statME] · Zbl 0389.62045
[47] Lee S, McLachlan GJ (2013a) Finite mixtures of multivariate skew \[t\] t-distributions: some recent and new results. Stat Comput. doi:10.1007/s11222-012-9362-4 · Zbl 1325.62107
[48] Lee SX, McLachlan GJ (2013b) EMMIX-uskew: an R package for fitting mixtures of multivariate skew \[t\] t-distributions via the EM algorithm. J Stat Softw. Preprint arXiv:1211.5290 · Zbl 0364.62022
[49] Lee SX, McLachlan GJ (2013c) On mixtures of skew-normal and skew \[t\] t-distributions. Adv Data Anal Classif. doi:10.1007/s11634-013-0132-8 · Zbl 1273.62115
[50] Lin TI (2009) Maximum likelihood estimation for multivariate skew-normal mixture models. J Multivar Anal 100:257-265 · Zbl 1152.62034 · doi:10.1016/j.jmva.2008.04.010
[51] Lin TI (2010) Robust mixture modeling using multivariate skew \[t\] t distribution. Stat Comput 20:343-356 · doi:10.1007/s11222-009-9128-9
[52] Lin TI, Ho HJ, Lee CR (2013) Flexible mixture modelling using the multivariate skew-\[t\] t-normal distribution. Stat Comput. doi:10.1007/s11222-013-9386-4 · Zbl 1325.62113
[53] Lindsay BG (1995) Mixture models: theory, geometry, and applications. In: NSF-CBMS regional conference series in probability and statistics, vol 5, Institute of Mathematical Statistics and the American Statistical Association, Alexandria, VA · Zbl 1163.62326
[54] Liseo B, Loperfido N (2003) A Bayesian interpretation of the multivariate skew-normal distribution. Stat Probab Lett 61:395-401 · Zbl 1101.62342 · doi:10.1016/S0167-7152(02)00398-X
[55] Lo K, Brinkman RR, Gottardo R (2008) Automated gating of flow cytometry data via robust model-based clustering. Cytom Part A 73:312-332
[56] Lo K, Hahne F, Brinkman RR, Gottardo R (2009) Flowclust: a bioconductor package for automated gating of flow cytometry data. BMC Bioinform 10:145 · doi:10.1186/1471-2105-10-145
[57] Martin D, Fowlkes C, Tal D, Malik J (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. Proc Int Conf Comput Vis 2:416-423
[58] McLachlan GJ, Basford KE (1988) Mixture models: inference and applications. Marcel Dekker, New York
[59] McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley-Interscience, Hokoben, NJ · Zbl 1165.62019 · doi:10.1002/9780470191613
[60] McLachlan GJ, Peel D (1998) Robust cluster analysis via mixtures of multivariate \[t\] t-distributions. In: Amin A, Dori D, Pudil P, Freeman H (eds) Lecture notes in computer science. Springer, Berlin, pp 658-666
[61] McLachlan GJ, Peel D (2000) Finite mixture models. Wiley series in probability and statistics, New York · Zbl 0963.62061
[62] McNeil AJ, Frey R, Embrechts P (2005) Quantitative risk management: concepts, techniques and tools. Princeton University Press, USA · Zbl 1089.91037
[63] Meignen S, Meignen H (2006) On the modeling of small sample distributions with generalized gaussian density in a maximum likelihood framework. IEEE Trans Image Process 15:1647-1652 · doi:10.1109/TIP.2006.873455
[64] Meilă M (2005) Comparing clusterings—an axiomatic view. In: In ICML ’05: proceedings of the 22nd international conference on machine learning, ACM Press, pp 577-584
[65] Mengersen KL, Robert CP, Titterington DM (2011) Mixtures: estimation and applications. Wiley, NewYork · Zbl 1218.62003 · doi:10.1002/9781119995678
[66] Nadarajah S (2008) Skewed distributions generated by the student’s \[t\] t kernel. Monte Carlo Methods Appl 13:289-404 · Zbl 1129.62049 · doi:10.1515/mcma.2007.021
[67] Nadarajah S, Kotz S (2003) Skewed distributions generated by the normal kernel. Stat Probab Lett 65: 269-277 · Zbl 1048.62014 · doi:10.1016/j.spl.2003.07.013
[68] Nguyen TM, Wu QMJ (2013) A nonsymmetric mixture model for unsupervised image segmentation. IEEE Trans Cybern 43:751-765 · doi:10.1109/TSMCB.2012.2215849
[69] Nikolic R (2010) flowKoh: self-organizing map for flow cytometry data analysis. http://commons.bcit.ca/radina_nikolic/docs/flowKoh_R_Code.zip
[70] Prates M, Lachos V, Cabral C (2011) mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions. R package version 0.3-2. http://CRAN.R-project.org/package=mixsmsn · Zbl 0581.62014
[71] Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jager PL, Mesirow JP (2009a) Automated high-dimensional flow cytometric data analysis. Proc Natl Acad Sci USA 106:8519-8524 · doi:10.1073/pnas.0903028106
[72] Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jager PL, Mesirow JP (2009b) FLAME: flow analysis with automated multivariate estimation. http://www.broadinstitute.org/cancer/software/genepattern/modules/FLAME/published_data
[73] Qian Y, Wei C, Lee F, Campbell J, Halliley J, Lee J, Cai J, Kong Y, Sadat E, Thomson E (2010) Elucidation of seventeen human peripheral blood b-cell subsets and quantification of the tetanus response using a density-based method for the automated identification of cell populations in multidimensional flow cytometry data. Cytom Part B 78:S69-S82 · doi:10.1002/cyto.b.20554
[74] R Development Team (2011) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. ISBN 3-900051-07-0
[75] Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846-850 · doi:10.1080/01621459.1971.10482356
[76] Riggi S, Ingrassia S (2013) Modeling high energy cosmic rays mass composition data via mixtures of multivariate skew-\[t\] t distributions. arXiv:13011178 [astro-phHE] · Zbl 1284.62193
[77] Rodrigues J (2006) A bayesian inference for the extended skew-normal measurement error model. Brazilian J Probab Stat 20:179-190 · Zbl 1272.62042
[78] Sahu SK, Dey DK, Branco MD (2003) A new class of multivariate skew distributions with applications to Bayesian regression models. Can J Stat 31:129-150 · Zbl 1039.62047 · doi:10.2307/3316064
[79] Soltyk S, Gupta R (2011) Application of the multivariate skew normal mixture model with the EM algorithm to value-at-risk. In: MODSIM 2011—19th International Congress on Modelling and Simulation, Perth, Australia, 12-16 Dec 2011 · Zbl 1213.62087
[80] Titterington DM, Smith AFM, Markov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York
[81] Vrbik I, McNicholas PD (2012) Analytic calculations for the EM algorithm for multivariate skew \[t\] t-mixture models. Stat Probab Lett 82:1169-1174 · Zbl 1244.65012 · doi:10.1016/j.spl.2012.02.020
[82] Wang K, McLachlan GJ, Ng SK, Peel D (2009) EMMIX-skew: EM algorithm for mixture of multivariate skew normal/\[t\] t distributions. R package version 1.0-12. http://www.maths.uq.edu.au/ gjm/mix_soft/EMMIX-skew · Zbl 0885.62062
[83] Zhang Y, Brady M, Smith S (2001) Segmentation of brain MR images through a hidden Markov random field model and the expectation maximization algorithm. IEEE Trans Med Imaging 20:45-57
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.