Clustering of imbalanced high-dimensional media data. (English) Zbl 1414.62232

Summary: Media content in large repositories usually exhibits multiple groups of strongly varying sizes. Media of potential interest often form notably smaller groups. Such media groups differ so much from the remaining data that it may be worthy to look at them in more detail. In contrast, media with popular content appear in larger groups. Identifying groups of varying sizes is addressed by clustering of imbalanced data. Clustering highly imbalanced media groups is additionally challenged by the high dimensionality of the underlying features. In this paper, we present the imbalanced clustering (IClust) algorithm designed to reveal group structures in high-dimensional media data. IClust employs an existing clustering method in order to find an initial set of a large number of potentially highly pure clusters which are then successively merged. The main advantage of IClust is that the number of clusters does not have to be pre-specified and that no specific assumptions about the cluster or data characteristics need to be made. Experiments on real-world media data demonstrate that in comparison to existing methods, IClust is able to better identify media groups, especially groups of small sizes.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI arXiv


[1] Bloisi DD, Iocchi L (2008) Rek-means: a \(k\)-means based clustering algorithm. In: Gasteratos A, Vincze M, Tsotsos JK (eds) International conference on computer vision systems (ICVS). Springer, pp 109-118
[2] Bodenhofer U, Kothmeier A, Hochreiter S (2011) APCluster: an R package for affinity propagation clustering. Bioinformatics 27:2463-2464
[3] Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Chen W, Naughton JF, Bernstein PA (eds) ACM SIGMOD international conference on management of data (ICMD), pp 93-104
[4] Fraley, C.; Raftery, AE, Model-based clustering, discriminant analysis, and density estimation, J Am Stat Assoc, 97, 611-631, (2000) · Zbl 1073.62545
[5] Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust version 4 for R: Normal mixture modeling for model-based clustering, classification, and density estimation, vol 597. University of Washington. https://cran.rproject.org/web/packages/mclust/
[6] Frey, BJ; Dueck, D., Clustering by passing messages between data points, Science, 315, 972-976, (2007) · Zbl 1226.94027
[7] Hartigan, JA; Wong, MA, A \(K\)-means clustering algorithm, Appl Stat, 28, 100-108, (1979) · Zbl 0447.62062
[8] Hasan, MA; Chaoji, V.; Salem, S.; Zaki, MJ, Robust partitional clustering by outlier and density insensitive seeding, Pattern Recogn Lett, 30, 994-1002, (2009)
[9] Ishioka T (2000) Extended \(k\)-means with an efficient estimation of the number of clusters. In: Leung K-S, Chan L-W, Meng H (eds) International conference on intelligent data engineering and automated learning, data mining, financial engineering, and intelligent agents (IDEAL), pp 17-22
[10] Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York · Zbl 1345.62009
[11] Krawczyk, B., Learning from imbalanced data: open challenges and future directions, Prog Artif Intell, 21, 1-12, (2016)
[12] Kriegel, H.; Kröger, P.; Zimek, A., Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering, ACM Trans Knowl Discov Data, 3, 1-58, (2009)
[13] Kriegel, H.; Kröger, P.; Sander, J.; Zimek, A., Density-based clustering, WIREs Data Min Knowl Discov, 1, 231-240, (2011)
[14] Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Fayyad UM, Chaudhuri S, Madigan D (eds) ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 16-22
[15] Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2015) Cluster: cluster analysis basics and extensions. R package version 2.0.3. https://cran.r-project.org/web/packages/cluster/
[16] Müller, E.; Günnemann, S.; Assent, I.; Seidl, T., Evaluating clustering in subspace projections of high dimensional data, Proc VLDB Endow (PVLDB), 2, 1270-1281, (2009)
[17] Murtagh, F.; Legendre, P., Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion?, J Classif, 31, 274-295, (2014) · Zbl 1360.62344
[18] Parsons, L.; Haque, E.; Liu, H., Subspace clustering for high dimensional data: a review, SIGKDD Explor Newsl, 6, 90-105, (2004)
[19] Qian J, Saligrama V (2014) Spectral clustering with imbalanced data. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3057-3061
[20] R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
[21] Rosenberg A, Hirschberg J (2007) V-measure: a conditional entropy-based external cluster evaluation measure. In: Eisner J (ed) Joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 410-420
[22] Rousseeuw, PJ, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J Comput Appl Math, 20, 53-65, (1987) · Zbl 0636.62059
[23] Walesiak M, Dudek A (2015) ClusterSim: searching for optimal clustering procedure for a data set. R package version 0.44-2. https://CRAN.R-project.org/package=clusterSim
[24] Wang Y, Chen L (2014) Multi-exemplar based clustering for imbalanced data. In: International conference on control automation robotics and vision (ICARCV), pp 1068-1073
[25] Zhao Y, Karypis G (2002) Criterion functions for document clustering: experiments and analysis. Tech rep, University of Minnesota · Zbl 1089.68615
[26] Zimek, A.; Schubert, E.; Kriegel, HP, A survey on unsupervised outlier detection in high-dimensional numerical data, Stat Anal Data Min, 5, 363-387, (2012)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.