×

Factor probabilistic distance clustering (FPDC): a new clustering method. (English) Zbl 1414.62279

Summary: Factor clustering methods have been developed in recent years thanks to improvements in computational power. These methods perform a linear transformation of data and a clustering of the transformed data, optimizing a common criterion. Probabilistic distance (PD)-clustering is an iterative, distribution free, probabilistic clustering method. Factor PD-clustering (FPDC) is based on PD-clustering and involves a linear transformation of the original variables into a reduced number of orthogonal ones using a common criterion with PD-clustering. This paper demonstrates that Tucker3 decomposition can be used to accomplish this transformation. Factor PD-clustering alternatingly exploits Tucker3 decomposition and PD-clustering on transformed data until convergence is achieved. This method can significantly improve the PD-clustering algorithm performance; large data sets can thus be partitioned into clusters with increasing stability and robustness of the results. Real and simulated data sets are used to compare FPDC with its main competitors, where it performs equally well when clusters are elliptically shaped but outperforms its competitors with non-Gaussian shaped clusters or noisy data.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Andersson, CA; Bro, R., The N-way toolbox for MATLAB, Chemom Intell Lab Syst, 52, 1-4, (2000)
[2] Andrews, JL; McNicholas, PD, Extending mixtures of multivariate t-factor analyzers, Stat Comput, 21, 361-373, (2011) · Zbl 1255.62171
[3] Arabie, P.; Hubert, L.; Bagozzi, R. (ed.), Cluster analysis in marketing research, 160-189, (1994), Oxford
[4] Ben-Israel, A.; Iyigun, C., Probabilistic d-clustering, J Classif, 25, 5-26, (2008) · Zbl 1260.62039
[5] Bezdek, J., Numerical taxonomy with fuzzy sets, J Math Biol, 1, 57-71, (1974) · Zbl 0403.62039
[6] Bock, HH, On the interface between cluster analysis, principal component analysis, and multidimensional scaling, Multivar Stat Model Data Anal, 8, 17-34, (1987) · Zbl 0627.62068
[7] Bouveyron, C.; Brunet, C., Simultaneous model-based clustering and visualization in the Fisher discriminative subspace, Stat Comput, 22, 301-324, (2012) · Zbl 1322.62162
[8] Bouveyron, C.; Brunet-Saumard, C., Model-based clustering of high-dimensional data: a review, Comput Stat Data Anal, 71, 52-78, (2014) · Zbl 1306.65033
[9] Campbell JG, Fraley F, Murtagh F, Raftery AE (1997) Linear flaw detection in woven textiles using model-based clustering. Pattern Recogn Lett 18:1539-1548
[10] Ceulemans, E.; Kiers, HAL, Selecting among three-mode principal component models of different types and complexities: a numerical convex hull based method, Br J Math Stat Psychol, 59, 133-150, (2006)
[11] Chiang, M.; Mirkin, B., Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads, J Classif, 27, 3-40, (2010) · Zbl 1337.62127
[12] Core Team R (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
[13] Craen, S.; Commandeur, J.; Frank, L.; Heiser, W., Effects of group size and lack of sphericity on the recovery of clusters in k-means cluster analysis, Multivar Behav Res, 41, 127-145, (2006)
[14] Sarbo, WS; Manrai, AK, A new multidimensional scaling methodology for the analysis of asymmetric proximity data in marketing research, Mark Sci, 11, 1-20, (1992)
[15] De Soete, G. and J. D. Carroll (1994). k-means clustering in a low-dimensional Euclidean space. In: Diday E, Lechevallier Y, Schader M et al (eds) New approaches in classification and data analysis. Springer, Heidelberg, pp 212-219
[16] Franczak BC, McNicholas PD, Browne RB, Murray PM (2013) Parsimonious shifted asymmetric Laplace mixtures. arXiv:1311:0317
[17] Franczak, BC; Tortora, C.; Browne, RP; McNicholas, PD, Unsupervised learning via mixtures of skewed distributions with hypercube contours, Pattern Recognit Lett, 58, 69-76, (2015)
[18] Ghahramani Z, Hinton GE (1997) The EM algorithm for mixtures of factor analyzers. Crg-tr-96-1, Univ. Toronto, Toronto
[19] Hwang, H.; Dillon, WR; Takane, Y., An extension of multiple correspondence analysis for identifying heterogenous subgroups of respondents, Psychometrika, 71, 161-171, (2006) · Zbl 1306.62435
[20] Iodice D’Enza, A.; Palumbo, F.; Greenacre, M., Exploratory data analysis leading towards the most interesting simple association rules, Comput Stat Data Anal, 52, 3269-3281, (2008) · Zbl 1452.62051
[21] Iyigun C (2007) Probabilistic distance clustering. Ph.D. thesis, New Brunswick Rutgers, The State University of New Jersey
[22] Jain, AK, Data clustering: 50 years beyond k-means, Pattern Recognit Lett, 31, 651-666, (2009)
[23] Karlis, D.; Santourian, A., Model-based clustering with non-elliptically contoured distributions, Stat Comput, 19, 73-83, (2009)
[24] Kiers, HAL; Kinderen, A., A fast method for choosing the numbers of components in Tucker3 analysis, Br J MathStat Psychol, 56, 119-125, (2003)
[25] Kroonenberg PM (2008) Applied multiway data analysis. Ebooks Corporation, Hoboken · Zbl 1160.62002
[26] Kroonenberg, PM; Voort, THA, Multiplicatieve decompositie van interacties bij oordelen over de werkelijkheidswaarde van televisiefilms [multiplicative decomposition of interactions for judgments of realism of television films], Kwantitatieve Methoden, 8, 117-144, (1987)
[27] Lebart A, Morineau A, Warwick K (1984) Multivariate statistical descriptive analysis. Wiley, New York · Zbl 0658.62069
[28] Lee, SX; McLachlan, GJ, On mixtures of skew normal and skew t-distributions, Adv Data Anal Classif, 7, 241-266, (2013) · Zbl 1273.62115
[29] Lin T-I, McLachlan GJ, Lee SX (2013) Extending mixtures of factor models using the restricted multivariate skew-normal distribution. arXiv:1307:1748
[30] Lin, T-I, Maximum likelihood estimation for multivariate skew normal mixture models, J Multivar Anal, 100, 257-265, (2009) · Zbl 1152.62034
[31] Lin, T-I, Robust mixture modeling using multivariate skew t distributions, Stat Comput, 20, 343-356, (2010)
[32] Lin, T-I; McNicholas, PD; Hsiu, JH, Capturing patterns via parsimonious t mixture models, Stat Probab Lett, 88, 80-87, (2014) · Zbl 1369.62131
[33] Markos A, Iodice D’Enza A, Van de Velden M (2013) clustrd: methods for joint dimension reduction and clustering. R package version 0.1.2
[34] Maronna, RA; Zamar, RH, Robust estimates of location and dispersion for high-dimensional datasets, Technometrics, 44, 307-317, (2002)
[35] McLachlan GJ, Peel D (2000b) Mixtures of factor analyzers. In: Morgan Kaufman SF (ed) Proccedings of the seventeenth international conference on machine learning, pp 599-606
[36] McLachlan, GJ; Peel, D.; Bean, RW, Modelling high-dimensional data by mixtures of factor analyzers, Comput Stat Data Anal, 41, 379-388, (2003) · Zbl 1256.62036
[37] McLachlan GJ, Peel D (2000a) Finite mixture models. Wiley Interscience, New York · Zbl 0963.62061
[38] McNicholas PD, Jampani KR, McDaid AF, Murphy TB, Banks L (2011) pgmm: Parsimonious Gaussian Mixture Models. R package version 1:1
[39] McNicholas SM, McNicholas PD, Browne RP (2013) Mixtures of variance-gamma distributions. arXiv:1309.2695
[40] McNicholas, PD; Murphy, T., Parsimonious Gaussian mixture models, Stat Comput, 18, 285-296, (2008)
[41] Murray, PM; Browne, RB; McNicholas, PD, Mixtures of skew-t factor analyzers, Comput Stat Data Anal, 77, 326-335, (2014) · Zbl 06984029
[42] Palumbo F, Vistocco D, Morineau A (2008) Huge multidimensional data visualization: back to the virtue of principal coordinates and dendrograms in the new computer age. In: Chun-houh Chen WH, Unwin A (eds) Handbook of data visualization. Springer, pp 349-387 · Zbl 1147.68464
[43] Rachev ST, Klebanov LB, Stoyanov SV, Fabozzi FJ (2013) The methods of distances in the theory of probability and statistics. Springer
[44] Rocci, R.; Gattone, SA; Vichi, M., A new dimension reduction method: factor discriminant k-means, J Classif, 28, 210-226, (2011) · Zbl 1226.62062
[45] Steane, MA; McNicholas, PD; Yada, R., Model-based classification via mixtures of multivariate t-factor analyzers, Commun Stat Simul Comput, 41, 510-523, (2012) · Zbl 1294.62142
[46] Stute W, Zhu LX (1995) Asymptotics of k-means clustering based on projection pursuit. Sankhyā 57(3):462-471 · Zbl 0857.62064
[47] Subedi, S.; McNicholas, PD, Variational Bayes approximations for clustering via mixtures of normal inverse Gaussian distributions, Adv Data Anal Classif, 8, 167-193, (2014)
[48] The MathWorks Inc. (2007) MATLAB—The Language of Technical Computing, Version 7.5. The MathWorks Inc., Natick
[49] Timmerman ME, Ceulemans E, Roover K, Leeuwen K (2013) Subspace k-means clustering. Behav Res Methods Res 45(4):1011-1023
[50] Timmerman, ME; Ceulemans, E.; Kiers, HAL; Vichi, M., Factorial and reduced k-means reconsidered, Comput Stat Data Anal, 54, 1858-1871, (2010) · Zbl 1284.62396
[51] Timmerman, ME; Kiers, HAL, Three-mode principal components analysis: choosing the numbers of components and sensitivity to local optima, Br J Math Stat Psychol, 53, 1-16, (2000)
[52] Tortora, C. and M. Marino (2014). Robustness and stability analysis of factor PD-clustering on large social datasets. In D. Vicari, A. Okada, G. Ragozini, and C. Weihs (Eds.), Analysis and Modeling of Complex Data in Behavioral and Social Sciences, pp. 273-281. Springer
[53] Tortora C, Gettler Summa M, Palumbo F (2013) Factor PD-clustering. In: Berthold UL, Dirk V (ed) Algorithms from and for nature and life, pp 115-123
[54] Tortora C, McNicholas PD, Browne RP (2015) A mixture of generalized hyperbolic factor analyzers. Adv Data Anal Classif (in press)
[55] Tortora C, McNicholas PD (2014) FPDclustering: PD-clustering and factor PD-clustering. R package version 1.0
[56] Tortora C, Palumbo F (2014) FPDC. MATLAB and Statistics Toolbox Release (2012a) The MathWorks Inc. Natick
[57] Tucker, LR, Some mathematical notes on three-mode factor analysis, Psychometrika, 31, 279-311, (1966)
[58] Vermunt, JK, K-means may perform as well as mixture model clustering but may also be much worse: comment on Steinley and Brusco (2011), Psychol Methods, 16, 82-88, (2011)
[59] Vichi, M.; Kiers, HAL, Factorial k-means analysis for two way data, Comput Stat Data Anal, 37, 29-64, (2001) · Zbl 1051.62056
[60] Vichi, M.; Saporta, G., Clustering and disjoint principal component analysis, Comput Stat Data Anal, 53, 3194-3208, (2009) · Zbl 1453.62230
[61] Vrbik, I.; McNicholas, PD, Parsimonious skew mixture models for model-based clustering and classification, Comput Stat Data Anal, 71, 196-210, (2014) · Zbl 1471.62202
[62] Yamamoto, M.; Hwang, H., A general formulation of cluster analysis with dimension reduction and subspace separation, Behaviormetrika, 41, 115-129, (2014)
[63] Zadeh, LA, Fuzzy sets, Inf Control, 8, 338-353, (1965) · Zbl 0139.24606
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.