×

Mixtures of skew-\(t\) factor analyzers. (English) Zbl 1506.62132

Summary: A mixture of skew-\(t\) factor analyzers is introduced as well as a family of mixture models based thereon. The particular formulation of the skew-\(t\) distribution used arises as a special case of the generalized hyperbolic distribution. Like their Gaussian and \(t\)-distribution analogues, mixtures of skew-\(t\) factor analyzers are very well-suited for model-based clustering of high-dimensional data. The alternating expectation-conditional maximization algorithm is used for model parameter estimation and the Bayesian information criterion is used for model selection. The models are applied to both real and simulated data, giving superior clustering results when compared to a well-established family of Gaussian mixture models.

MSC:

62-08 Computational methods for problems pertaining to statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)

Software:

PGMM; R; mclust; clusfind
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Aas, K.; Haff, I. H., The generalized hyperbolic skew student’s t-distribution, J. Financ. Econom., 4, 2, 275-309, (2006)
[2] Aitken, A. C., On bernoulli’s numerical solution of algebraic equations, Proc. Roy. Soc. Edinburgh, 46, 289-305, (1926) · JFM 52.0098.05
[3] Andrews, J. L.; McNicholas, P. D., Extending mixtures of multivariate \(t\)-factor analyzers, Stat. Comput., 21, 3, 361-373, (2011) · Zbl 1255.62175
[4] Azzalini, A., Browne, R.P., Genton, M.G., McNicholas, P.D., 2014. Comparing two formulations of skew distributions with special reference to model-based clustering. arxiv:1402.5431.
[5] Barndorff-Nielsen, O.; Halgreen, C., Infinite divisibility of the hyperbolic and generalized inverse Gaussian distributions, Z. Wahrscheinlichkeitstheor. Verwandte Geb., 38, 309-311, (1977) · Zbl 0403.60026
[6] Barndorff-Nielsen, O.; Shephard, N., Non-Gaussian Ornstein-Uhlenbeck-based models and some of their uses in financial economics, J. Roy. Statist. Soc. Ser. B, 63, 167-241, (2001) · Zbl 0983.60028
[7] Blæ sild, P., The shape of the generalized inverse Gaussian and hyperbolic distributions, (Research Report 37, (1978), Department of Theoretical Statistics, Aarhus University Denmark)
[8] Böhning, D.; Dietz, E.; Schaub, R.; Schlattmann, P.; Lindsay, B., The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family, Ann. Inst. Statist. Math., 46, 373-388, (1994) · Zbl 0802.62017
[9] Branco, M.; Dey, D., A general class of multivariate skew-elliptical distributions, J. Multivariate Anal., 79, 99-113, (2001) · Zbl 0992.62047
[10] Browne, R.P., McNicholas, P.D., 2013. A mixture of generalized hyperbolic distributions. arXiv preprint arxiv:1305.1036.
[11] Browne, R. P.; McNicholas, P. D.; Sparling, M. D., Model-based learning using a mixture of mixtures of Gaussian and uniform distributions, IEEE Trans. Pattern Anal. Mach. Intell., 34, 4, 814-817, (2012)
[12] Campbell, J.; Fraley, C.; Murtagh, F.; Raftery, A., Linear flaw detection in woven textiles using model-based clustering, Pattern Recognit. Lett., 18, 1539-1548, (1997)
[13] Dasgupta, A.; Raftery, A. E., Detecting features in spatial point processed with clutter via model-based clustering, J. Amer. Statist. Assoc., 93, 294-302, (1998) · Zbl 0906.62105
[14] Dempster, A. P.; Laird, N. M.; Rubin, D. B., Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Statist. Soc. Ser. B, 39, 1, 1-38, (1977) · Zbl 0364.62022
[15] Fraley, C.; Raftery, A. E., MCLUST: software for model-based cluster analysis, J. Classification, 16, 297-306, (1999) · Zbl 0951.91500
[16] Franczak, B.; Browne, R. P.; McNicholas, P. D., Mixtures of shifted asymmetric Laplace distributions, IEEE Trans. Pattern Anal. Mach. Intell., (2014), (in press)
[17] Franczak, B.C., McNicholas, P.D., Browne, R.B., Murray, P.M., 2013. Parsimonious shifted asymmetric Laplace mixtures. Arxiv preprint arxiv:1311.0317.
[18] Ghahramani, Z.; Hinton, G., The EM algorithm for factor analyzers, (Technical Report CRG-TR-96-1, (1997), University of Toronto Toronto)
[19] Golub, T.; Slonim, D.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.; Coller, H.; Loh, M.; Downing, J.; Caligiuri, M.; Bloomfield, C.; Lander, E., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286, 531-537, (1999)
[20] Good, I, J., The population frequencies of species and the estimation of population parameters, Biometrika, 40, 237-260, (1953) · Zbl 0051.37103
[21] Halgreen, C., Self-decomposibility of the generalized inverse Gaussian and hyperbolic distributions, Z. Wahrscheinlichkeitstheor. Verwandte Geb., 47, 13-18, (1979) · Zbl 0377.60020
[22] Hastie, T.; Tibshirani, R., Discriminant analysis by Gaussian mixtures, J. Roy. Statist. Soc. Ser. B, 58, 155-176, (1996) · Zbl 0850.62476
[23] Hubert, L.; Arabie, P., Comparing partitions, J. Classification, 2, 193-218, (1985)
[24] Jones, M.; Faddy, M., A skew extension of the \(t\)-distribution, with applications, J. Roy. Statist. Soc. Ser. B, 65, 159-174, (2003) · Zbl 1063.62013
[25] Jørgensen, B., Statistical properties of the generalized inverse Gaussian distribution, (1982), Springer-Verlag New York · Zbl 0486.62022
[26] Karlis, D.; Meligkotsidou, L., Finite mixtures of multivariate Poisson distributions with application, J. Statist. Plann. Inference, 137, 6, 1942-1960, (2007) · Zbl 1116.60006
[27] Kaufman, L.; Rousseeuw, P. J., Finding groups in data: an introduction to cluster analysis, (1990), Wiley New York · Zbl 1345.62009
[28] Lee, S., McLachlan, G.J., 2012. On the fitting of mixtures of multivariate skew t-distributions via the EM algorithm. arxiv:1109.4706.
[29] Lee, S. X.; McLachlan, G. J., On mixtures of skew normal and skew t-distributions, Adv. Data Anal. Classif., 7, 3, 241-266, (2013) · Zbl 1273.62115
[30] Lin, T.-I., Maximum likelihood estimation for multivariate skew normal mixture models, J. Multivariate Anal., 100, 257-265, (2009) · Zbl 1152.62034
[31] Lin, T.-I., McLachlan, G.J., Lee, S.X., 2013. Extending mixtures of factor models using the restricted multivariate skew-normal distribution. arxiv:1307.1748.
[32] Lindsay, B. G., Mixture models: theory, geometry and applications, (NSF-CBMS Regional Conference Series in Probability and Statistics, vol. 5, (1995), Institute of Mathematical Statistics Hayward, California), 63-65
[33] Lopes, H. F.; West, M., Bayesian model assessment in factor analysis, Statist. Sinica, 14, 41-67, (2004) · Zbl 1035.62060
[34] Ma, Y.; Genton, M., A flexible class of skew-symmetric distributions, Scand. J. Stat., 31, 459-468, (2004) · Zbl 1063.62079
[35] McLachlan, G. J.; Bean, R. W.; Jones, L. B.-T., Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution, Comput. Statist. Data Anal., 51, 11, 5327-5338, (2007) · Zbl 1445.62053
[36] McLachlan, G. J.; Bean, R. W.; Peel, D., A mixture model-based approach to the clustering of microarray expression data, Bioinformatics, 18, 3, 413-422, (2002)
[37] McLachlan, G. J.; Krishnan, T., The EM algorithm and extensions, (2008), Wiley New York · Zbl 1165.62019
[38] McLachlan, G. J.; Peel, D., Mixtures of factor analyzers, (Seventh International Conference on Machine Learning. San Francisco, (2000))
[39] McNicholas, P. D., Model-based classification using latent Gaussian mixture models, J. Statist. Plann. Inference, 140, 5, 1175-1181, (2010) · Zbl 1181.62095
[40] McNicholas, P.D., Jampani, K.R., McDaid, A.F., Murphy, T.B., Banks, L., 2011. pgmm: Parsimonious Gaussian Mixture Models. R package version 1.0.
[41] McNicholas, P. D.; Murphy, T. B., Parsimonious Gaussian mixture models, Stat. Comput., 18, 285-296, (2008)
[42] McNicholas, P. D.; Murphy, T. B., Model-based clustering of microarray expression data via latent Gaussian mixture models, Bioinformatics, 26, 21, 2705-2712, (2010)
[43] McNicholas, P. D.; Murphy, T. B., Model-based clustering of longitudinal data, Canad. J. Statist., 38, 1, 153-168, (2010) · Zbl 1190.62120
[44] McNicholas, P. D.; Murphy, T. B.; McDaid, A. F.; Frost, D., Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models, Comput. Statist. Data Anal., 54, 3, 711-723, (2010) · Zbl 1464.62131
[45] McNicholas, P. D.; Subedi, S., Clustering gene expression time course data using mixtures of multivariate t-distributions, J. Statist. Plann. Inference, 142, 5, 1114-1127, (2012) · Zbl 1236.62068
[46] Meng, X.-L.; Rubin, D., Maximum likelihood estimation via the ECM algorithm: a general framework, Biometrika, 80, 267-278, (1993) · Zbl 0778.62022
[47] Meng, X.-L.; van Dyk, D., The EM algorithm—an old folk song sung to a fast new tune (with discussion), J. Roy. Statist. Soc. Ser. B, 59, 511-567, (1997) · Zbl 1090.62518
[48] Montanari, A.; Viroli, C., A skew-normal factor model for the analysis of student satisfaction towards university courses, J. Appl. Stat., 37, 3, 473-487, (2010)
[49] Morris, K.; McNicholas, P. D., Dimension reduction for model-based clustering via mixtures of shifted asymmetric Laplace distributions, Statist. Probab. Lett., 83, 9, 2088-2093, (2013) · Zbl 1282.62153
[50] Morris, K.; McNicholas, P. D.; Scrucca, L., Dimension reduction for model-based clustering via mixtures of multivariate t-distributions, Adv. Data Anal. Classif., 7, 3, 321-338, (2013) · Zbl 1273.62141
[51] Murray, P.M., Browne, R.P., McNicholas, P.D., 2013a. Mixtures of skew-t factor analyzers. arxiv:1305.4301.
[52] Murray, P.M., Browne, R.P., McNicholas, P.D., 2013b. Mixtures of ‘unrestricted’ skew-t factor analyzers. arxiv:1310.6224.
[53] Murray, P. M.; McNicholas, P. D.; Browne, R. P., A mixture of common skew-\(t\) factor analyzers, Stat, 3, 1, 68-82, (2014)
[54] Nakai, K.; Kanehisa, M., Expert system for predicting protein localization sites in Gram-negative bacteria, Prot.: Struct. Funct. Bioinform., 11, 2, 95-110, (1991)
[55] Nakai, K.; Kanehisa, M., A knowledge base for predicting protein localization sites in eukaryotic cells, Genomics, 14, 897-911, (1992), mEDLINE Abstract
[56] Peel, D.; McLachlan, G. J., Robust mixture modelling using the t distribution, Stat. Comput., 10, 4, 339-348, (2000)
[57] R Core Team. 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
[58] Sahu, S.; Dey, D.; Branco, M., A new class of multivariate skew distributions with application to Bayesian regression models, Canad. J. Statist., 31, 129-150, (2003) · Zbl 1039.62047
[59] Schwarz, G., Estimating the dimension of a model, Ann. Statist., 6, 461-464, (1978) · Zbl 0379.62005
[60] Spearman, C., The proof and measurement of association between two things, Am. J. Psychol., 15, 1, 72-101, (1904)
[61] Tipping, T.; Bishop, C., Mixtures of probabilistic component analyzers, Neural Comput., 11, 2, 443-482, (1999)
[62] Tortora, C., McNicholas, P.D., Browne, R.P., 2013. A mixture of generalized hyperbolic factor analyzers. arxiv:1311.6530.
[63] Vrbik, I.; McNicholas, P. D., Analytic calculations for the EM algorithm for multivariate skew-mixture models, Statist. Probab. Lett., 82, 6, 1169-1174, (2012) · Zbl 1244.65012
[64] Vrbik, I.; McNicholas, P. D., Parsimonious skew mixture models for model-based clustering and classification, Comput. Statist. Data Anal., 71, 196-210, (2014)
[65] Woodbury, M., Inverting modified matrices, (Tech. Rep. 42, (1950), Princeton University Princeton, NJ)
[66] Zhou, H.; Lange, K. L., On the bumpy road to the dominant mode, Scand. J. Stat., 37, 4, 612-631, (2010) · Zbl 1226.62027
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.