×

Generalization of clustering agreements and distances for overlapping clusters and network communities. (English) Zbl 1405.62084

Summary: A measure of distance between two clusterings has important applications, including clustering validation and ensemble clustering. Generally, such distance measure provides navigation through the space of possible clusterings. Mostly used in cluster validation, a normalized clustering distance, a.k.a. agreement measure, compares a given clustering result against the ground-truth clustering. The two widely-used clustering agreement measures are adjusted rand index and normalized mutual information. In this paper, we present a generalized clustering distance from which these two measures can be derived. We then use this generalization to construct new measures specific for comparing (dis)agreement of clusterings in networks, a.k.a. communities. Further, we discuss the difficulty of extending the current, contingency based, formulations to overlapping cases, and present an alternative algebraic formulation for these (dis)agreement measures. Unlike the original measures, the new co-membership based formulation is easily extendable for different cases, including overlapping clusters and clusters of inter-related data. These two extensions are, in particular, important in the context of finding communities in complex networks.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
05C82 Small world graphs, complex networks (graph-theoretic aspects)
68T05 Learning and adaptive systems in artificial intelligence
91D30 Social networks; opinion dynamics
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Aggarwal CC, Reddy CK (2014) Data clustering: algorithms and applications. CRC Press, Boca Raton
[2] Albatineh AN, Niewiadomska-Bugaj M, Mihalko D (2006) On similarity indices and correction for chance agreement. J Classif 23:301-313 · Zbl 1336.62168 · doi:10.1007/s00357-006-0017-z
[3] Anderson DT, Bezdek JC, Popescu M, Keller JM (2010) Comparing fuzzy, probabilistic, and possibilistic partitions. IEEE Trans Fuzzy Syst 18(5):906-918 · doi:10.1109/TFUZZ.2010.2052258
[4] Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with Bregman divergences. J Mach Learn Res 6:1705-1749 · Zbl 1190.62117
[5] Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008:P10008+ · Zbl 1459.91130 · doi:10.1088/1742-5468/2008/10/P10008
[6] Brouwer RK (2008) Extending the rand, adjusted rand and Jaccard indices to fuzzy partitions. J Intell Inf Syst 32(3):213-235 · doi:10.1007/s10844-008-0054-7
[7] Campello RR (2010) Generalized external indexes for comparing data partitions with overlapping categories. Pattern Recogn Lett 31(9):966-975 · doi:10.1016/j.patrec.2010.01.002
[8] Collins LM, Dent CW (1988) Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivar Behav Res 23(2):231-242 · doi:10.1207/s15327906mbr2302_6
[9] Cortina-Borja M (2012) Handbook of parametric and nonparametric statistical procedures, 5th edn. J R Stat Soc Ser A 175(3):829-829
[10] Cui Y, Fern X, Dy J (2007) Non-redundant multi-view clustering via orthogonalization. In: Seventh IEEE International Conference on Data Mining, 2007. ICDM 2007. pp 133-142
[11] Dhillon IS, Tropp JA (2007) Matrix nearness problems with bregman divergences. SIAM J Matrix Anal Appl 29(4):1120-1146 · Zbl 1153.65044 · doi:10.1137/060649021
[12] Fortunato S (2010) Community detection in graphs. Phys Rep 486(35):75-174 · doi:10.1016/j.physrep.2009.11.002
[13] Gregory S (2010) Finding overlapping communities in networks by label propagation. New J Phys 12(10):103,018 · doi:10.1088/1367-2630/12/10/103018
[14] Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193-218 · doi:10.1007/BF01908075
[15] Hullermeier E, Rifqi M, Henzgen S, Senge R (2012) Comparing fuzzy partitions: a generalization of the rand index and related measures. IEEE Trans Fuzzy Syst 20(3):546-556 · doi:10.1109/TFUZZ.2011.2179303
[16] Kulis B, Sustik MA, Dhillon IS (2009) Low-rank kernel learning with bregman matrix divergences. J Mach Learn Res 10:341-376 · Zbl 1235.68166
[17] Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(5):056,117 · doi:10.1103/PhysRevE.80.056117
[18] Lancichinetti A, Fortunato S, Kertesz J (2008a) Detecting the overlapping and hierarchical community structure of complex networks. New J Phys 11(3):20
[19] Lancichinetti A, Fortunato S, Radicchi F (2008b) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046,110 · doi:10.1103/PhysRevE.78.046110
[20] Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically significant communities in networks. PLoS One 6(4):e18,961 · doi:10.1371/journal.pone.0018961
[21] Light RJ, Margolin BH (1971) An analysis of variance for categorical data. J Am Stat Assoc 66(335):534-544 · Zbl 0222.62035 · doi:10.1080/01621459.1971.10482297
[22] Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, New York · Zbl 1160.68008 · doi:10.1017/CBO9780511809071
[23] Mcauley J, Leskovec J (2014) Discovering social circles in ego networks. ACM Trans Knowl Discov Data 8(1):4:1-4:28 · doi:10.1145/2556612
[24] McDaid A, Hurley N (2010) Detecting highly overlapping communities with model-based overlapping seed expansion. In: 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), IEEE, pp 112-119
[25] McDaid AF, Greene D, Hurley N (2011) Normalized mutual information to evaluate overlapping community finding algorithms. arXiv:1110.2515
[26] Meilă M (2007) Comparing clusteringsan information based distance. J Multivar Anal 98(5):873-895 · Zbl 1298.91124 · doi:10.1016/j.jmva.2006.11.013
[27] Newman MEJ (2004) Fast algorithm for detecting community structure in networks. Phys Rev E 69(066):133
[28] Nielsen F, Nock R (2014) On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Process Lett 21(1):10-13 · doi:10.1109/LSP.2013.2288355
[29] Pons P, Latapy M (2005) Computing communities in large networks using random walks. In: Computer and Information Sciences-ISCIS 2005, Springer, pp 284-293 · Zbl 1161.68694
[30] Quere R, Le Capitaine H, Fraisseix N, Frelicot C (2010) On normalizing fuzzy coincidence matrices to compare fuzzy and/or possibilistic partitions with the rand index. In: 2010 IEEE International Conference on Data Mining, IEEE, pp 977-982
[31] Ronhovde P, Nussinov Z (2009) Multiresolution community detection for megascale networks by information-based replica correlations. Phys Rev E 80(1):016,109 · doi:10.1103/PhysRevE.80.016109
[32] Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci 105(4):1118-1123 · doi:10.1073/pnas.0706851105
[33] Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, ICML ’09, pp 1073-1080
[34] Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837-2854 · Zbl 1242.62062
[35] Warrens MJ (2008a) On similarity coefficients for \[2\times 22\]×2 tables and correction for chance. Psychometrika 73:487-502 · Zbl 1301.62125 · doi:10.1007/s11336-008-9059-y
[36] Warrens MJ (2008b) On the equivalence of Cohen’s Kappa and the Hubert-Arabie adjusted rand index. J Classif 25:177-183 · Zbl 1276.62043 · doi:10.1007/s00357-008-9023-7
[37] Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, KDD ’09, pp 877-886
[38] Yang J, Leskovec J (2013) Overlapping community detection at scale: a nonnegative matrix factorization approach. In: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, New York, pp 587-596
[39] Zhou D, Li J, Zha H (2005) A new mallows distance based metric for comparing clusterings. In: Proceedings of the 22nd international conference on Machine learning, ACM, New York, pp 1028-1035
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.