×

Interpretable clustering using unsupervised binary trees. (English) Zbl 1267.62075

Summary: We introduce a new method of interpretable clustering that uses unsupervised binary trees. It is a three-stage procedure, the first stage of which entails a series of recursive binary splits to reduce the heterogeneity of the data within the new subsamples. During the second stage (pruning), consideration is given to whether adjacent nodes can be aggregated. Finally, during the third stage (joining), similar clusters are joined together, even if they do not share the same parent originally. Consistency results are obtained, and the procedure is used on simulated and real data sets.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
05C90 Applications of graph theory

Software:

mclust; robustbase
PDF BibTeX XML Cite
Full Text: DOI arXiv Link

References:

[1] Andreopoulos B, An A, Wang X (2007) Hierarchical density-based clustering of categorical data and a simplification. Adv Knowl Discov Data Min, pp 11–22
[2] Basak J, Krishnapuram R (2005) Interpretable hierarchical clustering by construction an unsupervised decision tree. IEEE Trans Knowl Data Eng 17:121–132 · Zbl 05108645
[3] Blockeel H, De Raedt L, Ramon J (1998) Top–down induction of clustering trees. In: Morgan K (ed) Proceedings of the 15th international conference on machine learning, pp 55–63
[4] Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC, Boca Raton · Zbl 0541.62042
[5] Chavent M, Guinot C, Lechevallier Y, Tenenhaus M (1999) Méthodes Divisives de Classification et Segmentation Non Supervisée: Recherche d’une Typologie de la Peau Humanine Saine. Rev. Statistique Appliquée, XLVII 87–99
[6] Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd international conference on knowledge discovery and data mining. Portland, OR, pp 226–231
[7] Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631 · Zbl 1073.62545
[8] Fraley C, Raftery AE (2009) MCLUST Version 3 for R: Normal Mixture Modeling and Model-based Clustering, Technical Report No. 504, Department of Statistics, University of Washington
[9] Gan G, Ma C, Wu J (2007) Clustering data: theory, algorithms and applications. Springer, Berlin · Zbl 1185.68274
[10] García-Escudero L, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods. Adv Data Anal Classif 4(2):89–109 · Zbl 1284.62375
[11] Gini CW (1912) Variability and mutability, contribution to the study of statistical distributions and relations. Studi Economico-Giuridici della R. Universita de Cagliari
[12] Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, Berlin
[13] Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218 · Zbl 0587.62128
[14] Iyigun C, Ben-Israel A (2008) Probabilistic distance clustering adjusted for cluster size. Prob Eng Inf Sci 22:603–621 · Zbl 1152.68500
[15] Liu B, Xia Y, Yu PS (2000) Clustering through decision tree construction. CIKM ’00. In: Proceedings of the ninth international conference on information and knowledge management. ACM, New York, NY, USA, pp 20–29
[16] Maronna RA, Martin DR, Yohai VJ (2006) Robust statistics: theory and methods. Wiley Series in Probability and Statistics, Wiley, New York
[17] Melia M (2012) Local equivalences of distances between clusterings–a geometric perpective. Mach Learn 86(3):369–389 · Zbl 1267.68187
[18] Oh MS, Raftery A (2007) Model-based clustering with dissimilarities: a Bayesian approach. J Comput Gr Stat 16(3):559–585
[19] Papadimitriou C, Steiglitz K (1982) Combinatorial optimization: algorithms and complexity. Prentice Hall, Englewood Cliffs · Zbl 0503.90060
[20] Peña D, Prieto FJ (2001) Cluster identification using projections. J Am Stat Assoc 96(456):1433–1445 · Zbl 1051.62055
[21] Peña D, Prieto FJ (2006) A projection method for robust estimation and clustering in large data set. In: Zani S, Cerioli A, Riani M, Vichi M (eds) Data analysis, classification and the forward search. Springer, Berlin
[22] Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
[23] Steinley D, Brusco M (2007) Initializing K-means batch clustering: a critical evaluation of several techniques. J Classif 24:99–121 · Zbl 1144.62331
[24] Tryon RC (1939) Cluster analysis. McGraw-Hill, Columbus · Zbl 0023.09506
[25] Walther G (2010) Optimal and fast detection of spatial clusters with scan statistics. Ann Stat 24(2): 1010–1033 · Zbl 1183.62076
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.