zbMATH — the first resource for mathematics

Iterative factor clustering of binary data. (English) Zbl 1305.65035
Summary: Binary data represent a very special condition where both measures of distance and co-occurrence can be adopted. Euclidean distance-based non-hierarchical methods, like the $$k$$-means algorithm, or one of its versions, can be profitably used. When the number of available attributes increases the global clustering performance usually worsens. In such cases, to enhance group separability it is necessary to remove the irrelevant and redundant noisy information from the data. The present approach belongs to the category of attribute transformation strategy, and combines clustering and factorial techniques to identify attribute associations that characterize one or more homogeneous groups of statistical units. Furthermore, it provides graphical representations that facilitate the interpretation of the results.

MSC:
 65C60 Computational problems in statistics (MSC2010)
clusfind; ROCK
Full Text:
References:
 [1] Arabie, P; Hubert, L, Cluster analysis in marketing research, IEEE Trans Autom Control, 19, 716-723, (1994) [2] Caliński, T; Harabasz, J, A dendrite method for cluster analysis, Commun Stat A Theory, 3, 1-27, (1974) · Zbl 0273.62010 [3] Chae, SS; Dubien, JL; Warde, WD, A method of predicting the number of clusters using rands statistic, Comput Stat Data Anal, 50, 3531-3546, (2006) · Zbl 1446.62176 [4] Choi, SS; Cha, SS; Tappert, CC, A survey of binary similarity and sistance measures, J Syst Cybernet Inform, 8, 43-48, (2010) [5] Dimitriadou, E; Dolnicar, S; Weingassel, A, An examination of indexes for setermining the number of clusters in binary data sets, Psychometrika, 67, 137-160, (2002) · Zbl 1297.62229 [6] Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York · Zbl 0968.68140 [7] Dudoit, S; Fridlyand, J, A prediction-based resampling method for estimating the number of clusters in a dataset, Genome Biol, 3, 1-21, (2002) [8] Ertoz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Barbara D, Kamath C (eds) Proceedings of the third SIAM international conference on data mining, vol 112, pp 47-59 [9] Greenacre MJ (2007) Correspondence analysis in practice, 2nd edn. Chapman and Hall, Boca Raton · Zbl 1198.62061 [10] Guha, S; Rastogi, S; Shim, K, ROCK: a robust clustering algorithm for categorical attribute, Inform Syst, 25, 512-521, (2000) [11] Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning. Springer, New York · Zbl 0973.62007 [12] Hwang, H; Dillon, WR, Simultaneous two-way clustering of multiple correspondence analysis, Multivar Behav Res, 45, 186-208, (2010) [13] Hwang, H; Dillon, WR; Takane, Y, An extension of multiple correspondence analysis for identifying heterogenous subgroups of respondents, Psychometrika, 71, 161-171, (2006) · Zbl 1306.62435 [14] Javed, K; Babri, H; Saeed, M, Feature selection based on class-dependent densities for high-dimensional binary data, IEEE Trans Knowl Data Eng, 24, 465-477, (2012) [15] Kaufman L, Rousseeuw PJ (2005) Finding groups in data. An introduction to cluster analysis. Wiley, Hoboken [16] Kraus, MJ; Müssel, C; Palm, G; Kestler, HA, Multi-objective selection for collecting cluster alternatives, Comput Stat, 26, 341-353, (2011) · Zbl 1304.65048 [17] Kuncheva, LI; Vetrov, DP, Evaluation of stability of k-means cluster ensembles with respect to random initialization, IEEE Trans Pattern Anal, 28, 1798-1808, (2005) [18] Lauro, CN; Balbi, S, The analysis of structured qualitative data, Appl Stoch Model Data Anal, 15, 1-27, (1999) · Zbl 0927.62062 [19] Lauro CN, D’Ambra L (1984) L’analyse non symmétrique des correspondances. In: Diday E et al (eds) Data analysis and informatics, III. North Holland, Amsterdam, pp 433-446 [20] Lebart L, Morineau A, Warwick K (1984) Multivariate descriptive statistical analysis. Wiley, New York · Zbl 0658.62069 [21] Light, R; Margolin, B, An analysis of variance for categorical data, In J Am Stat Assoc, 66, 534-544, (1971) · Zbl 0222.62035 [22] MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, pp 281-297 · Zbl 0214.46201 [23] Mola, F; Siciliano, R, A fast splitting procedure for classification and regression trees, Stat Comput, 7, 208-216, (1997) [24] Mucha HJ (2002) An intelligent clustering clustering technique based on dual scaling. In: Nishisato S, Baba Y, Bozdogan H, Kanefuji K (eds) Measurement and multivariate analysis. Springer, Tokyo, pp 37-46 · Zbl 1090.62062 [25] Milligan, GW; Cooper, MC, An examination of procedures for determining the number of clusters in a data, Psychometrika, 50, 159-179, (1985) [26] Mirkin, B, Eleven ways to look at the chi-squared coefficient for contingency tables, Am Stat, 55, 111-120, (2001) [27] Mirkin, B, Choosing the number of clusters, WIREs Data Mining Knowl Disc, 1, 252-260, (2011) [28] Nocke, T; Schumann, H; Böhm, U, Methods for the visualization of clustered climate data, Comput Stat, 19, 74-94, (2004) · Zbl 1077.62541 [29] Palumbo F, Iodice D’Enza A (2012) Adaptive factorial clustering of binary data. In: Giusti A, Ritter G, Vichi M (eds) Classification and data mining. Studies in classification, data analysis, and knowledge organization, July 2012 · Zbl 0273.62010 [30] Palumbo F, Siciliano R (1999) Factorial discriminant analysis and probabilistic models. In: Metron, LVI, pp 186-198 · Zbl 0962.62057 [31] Buuren, S; Heiser, WJ, Clustering $$n$$ objects in $$k$$ groups under optimal scaling of variables, Psychometrika, 54, 699-706, (1989) [32] Vichi, M; Saporta, G, Clustering and disjoint principal component analysis, Comput Stat Data Anal, 53, 3194-3208, (2009) · Zbl 1453.62230 [33] Vichi, M; Kiers, H, Factorial k-means analysis for two way data, Comput Stat Data Anal, 37, 49-64, (2001) · Zbl 1051.62056
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.