×

Dimension reduction for model-based clustering via mixtures of multivariate \(t\)-distributions. (English) Zbl 1273.62141

Summary: We introduce a dimension reduction method for model-based clustering obtained from a finite mixture of \(t\)-distributions. This approach is based on existing work on reducing dimensionality in the case of finite Gaussian mixtures. The method relies on identifying a reduced subspace of the data by considering the extent to which group means and group covariances vary. This subspace contains linear combinations of the original data, which are ordered by importance via the associated eigenvalues. Observations can be projected onto the subspace and the resulting set of variables captures most of the clustering structure available in the data. The approach is illustrated using simulated and real data, where it outperforms its Gaussian analogue.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
65C60 Computational problems in statistics (MSC2010)

Keywords:

mixture models
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Andrews JL, McNicholas PD (2011a) Extending mixtures of multivariate $$t$$ t -factor analyzers. Stat Comput 21(3):361–373 · Zbl 1255.62171 · doi:10.1007/s11222-010-9175-2
[2] Andrews JL, McNicholas PD (2011b) Mixtures of modified $$t$$ t -factor analyzers for model-based clustering, classification, and discriminant analysis. J Stat Plan Inference 141(4):1479–1486 · Zbl 1204.62098 · doi:10.1016/j.jspi.2010.10.014
[3] Andrews JL, McNicholas PD (2012a) Model-based clustering, classification, and discriminant analysis via mixtures of multivariate $$t$$ t -distributions: the $$t$$ t EIGEN family. Stat Comput 22(5):1021–1029 · Zbl 1252.62062 · doi:10.1007/s11222-011-9272-x
[4] Andrews JL, McNicholas PD (2012b) teigen: model-based clustering and classification with the multivariate t-distribution. R package version 1.0
[5] Andrews JL, McNicholas PD, Subedi S (2011) Model-based classification via mixtures of multivariate $$t$$ t -distributions. Comput Stat Data Anal 55(1):520–529 · Zbl 1247.62151
[6] Baek J, McLachlan GJ (2011) Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27:1269–1276 · Zbl 05891285 · doi:10.1093/bioinformatics/btr112
[7] Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3): 803–821 · Zbl 0794.62034
[8] Boulesteix AL, Lambert-Lacroix S, Peyre J, Strimmer K (2011) plsgenomics: PLS analyses for genomics. R package version 1.2-6
[9] Bouveyron C, Brunet C (2012) Simultaneous model-based clustering and visualization in the Fisher discriminative subspace. Stat Comput 22(1):301–324 · Zbl 1322.62162 · doi:10.1007/s11222-011-9249-9
[10] Campbell NA, Mahon RJ (1974) A multivariate study of variation in two species of rock crab of genus leptograpsus. Aust J Zoo l 22:417–425 · doi:10.1071/ZO9740417
[11] Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28:781–793 · Zbl 05480211 · doi:10.1016/0031-3203(94)00125-6
[12] Dean N, Raftery AE (2009) clustvarsel: Variable selection for model-based clustering. R package version 1.3
[13] Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc 39(1):1–38 · Zbl 0364.62022
[14] Forina M, Armanino C, Castino M, Ubigli M (1986) Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25:189–201
[15] Fraley C, Raftery AE (1999) MCLUST: software for model-based cluster analysis. J Classif 16:297–306 · Zbl 0951.91500 · doi:10.1007/s003579900058
[16] Franczak B, Browne RP, McNicholas PD (2012) Mixtures of shifted asymmetric Laplace distributions. Arxiv, preprint arXiv:1207.1727v3
[17] Greselin F, Ingrassia S (2010a) Constrained monotone EM algorithms for mixtures of multivariate $$t$$ t -distributions. Stat Comput 20(1):9–22
[18] Greselin F, Ingrassia S (2010b) Weakly homoscedastic constraints for mixtures of $$t$$ t -distributions. In: Fink A, Lausen B, Seidel W, Ultsch A (eds) Advances in Data Analysis, Data Handling and Business Intelligence. Studies in Classification, Data Analysis, and Knowledge Organization, Springer, Berlin
[19] Hubert L, Arabie P (1985) Comparing partitions. J Classifi 2:193–218 · Zbl 0587.62128 · doi:10.1007/BF01908075
[20] Hubert M, Rousseeuw PJ, Vanden Branden K (2005) ROBPCA: a new approach to robust principal components analysis. Technometrics 47:64–79
[21] Hurley C (2004) Clustering visualizations of multivariate data. J Comput Gr Stat 13(4):788–806 · doi:10.1198/106186004X12425
[22] Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19:73–83 · doi:10.1007/s11222-008-9072-0
[23] Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7:673–679
[24] Lee SX, McLachlan GJ (2013) On mixtures of skew normal and skew t-distributions. Arxiv, preprint arXiv:1211.3602v3
[25] Li KC (1991) Sliced inverse regression for dimension reduction (with discussion). J Am Stat Assoc 86: 316–342 · Zbl 0742.62044
[26] Li KC (2000) High dimensional data analysis via the SIR/PHD approach, unpublished manuscript. http://www.stat.ucla.edu/\(\sim\)kcli/sir-PHD.pdf
[27] Lin TI (2010) Robust mixture modeling using multivariate skew $$t$$ t -distributions. Stat Comput 20:343–356 · doi:10.1007/s11222-009-9128-9
[28] Loader C (2012) locfit: Local Regression, Likelihood and Density Estimation. R package version 1.5-8
[29] Maugis C, Celeux G, Martin-Magniette ML (2009) Variable selection for clustering with Gaussian mixture models. Biometrics 65:701–709 · Zbl 1172.62021 · doi:10.1111/j.1541-0420.2008.01160.x
[30] McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley, New York · Zbl 1165.62019
[31] McLachlan GJ, Bean RW, Jones LT (2007) Extension of the mixture of factor analyzers model to incorporate the multivariate $$t$$ t -distribution. Comput Stat Data Anal 51(11):5327–5338 · Zbl 1445.62053 · doi:10.1016/j.csda.2006.09.015
[32] McNicholas PD (2013) Model-based clustering and classification via mixtures of multivariate t-distributions. In: Giudici P, Ingrassia S, Vichi M (eds) Statistical models for data analysis, studies in classification, data analysis, and knowledge organization. Springer International Publishing, Switzerland
[33] McNicholas PD, Murphy TB (2008) Parsimonious Gaussian mixture models. Stat Comput 18:285–296 · doi:10.1007/s11222-008-9056-0
[34] McNicholas PD, Murphy TB (2010) Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21):2705–2712 · doi:10.1093/bioinformatics/btq498
[35] McNicholas PD, Subedi S (2012) Clustering gene expression time course data using mixtures of multivariate t-distributions. J Stat Plan Inference 142(5):1114–1127 · Zbl 1236.62068 · doi:10.1016/j.jspi.2011.11.026
[36] Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80:267–278 · Zbl 0778.62022 · doi:10.1093/biomet/80.2.267
[37] Peel D, McLachlan GJ (2000) Robust mixture modelling using the $$t$$ t -distribution. Stat Comput 10:339–348 · doi:10.1023/A:1008981510081
[38] Qiu WL, Joe H (2006) Generation of random clusters with specified degree of separation. J Classifi 23(2):315–334 · Zbl 1336.62189 · doi:10.1007/s00357-006-0018-y
[39] R Development Core Team (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org
[40] Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473): 168–178 · Zbl 1118.62339
[41] Reaven GM, Miller RG (1979) An attempt to define the nature of chemical diabetes using a multidimensional analysis. Diabetologia 16:17–24 · doi:10.1007/BF00423145
[42] Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464 · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[43] Scrucca L (2010) Dimension reduction for model-based clustering. Stat Comput 20(4):471–484 · doi:10.1007/s11222-009-9138-7
[44] Steane MA, McNicholas PD, Yada R (2012) Model-based classification via mixtures of multivariate t-factor analyzers. Commun Stat Simul Comput 41(4):510–523 · Zbl 1294.62142
[45] Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Nat Acad Sci USA 99(10):6567–6572 · doi:10.1073/pnas.082099299
[46] Todorov V, Filzmoser P (2009) An object-oriented framework for robust multivariate analysis. J Stat Softw 32(3):1–47. http://www.jstatsoft.org/v32/i03/
[47] Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York. http://www.stats.ox.ac.uk/pub/MASS4 · Zbl 1006.62003
[48] Vrbik I, McNicholas PD (2012) Analytic calculations for the EM algorithm for multivariate skew-mixture models. Stat Prob Lett 82(6):1169–1174 · Zbl 1244.65012 · doi:10.1016/j.spl.2012.02.020
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.