×

Efficient mixture model for clustering of sparse high dimensional binary data. (English) Zbl 1464.62328

Summary: Clustering is one of the fundamental tools for preliminary analysis of data. While most of the clustering methods are designed for continuous data, sparse high-dimensional binary representations became very popular in various domains such as text mining or cheminformatics. The application of classical clustering tools to this type of data usually proves to be very inefficient, both in terms of computational complexity as well as in terms of the utility of the results. In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative. In contrast to classical mixture models based on the EM algorithm, SparseMix: is specially designed for the processing of sparse data; can be efficiently realized by an on-line Hartigan optimization algorithm; describes every cluster by the most representative vector. We have performed extensive experimental studies on various types of data, which confirmed that SparseMix builds partitions with a higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62R07 Statistical aspects of big data and data science
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Asuncion DNA (2007) UCI machine learning repository
[2] Bai, L.; Liang, J.; Dang, C.; Cao, F., A novel attribute weighting algorithm for clustering high-dimensional categorical data, Pattern Recognit, 44, 12, 2843-2861 (2011) · Zbl 1218.68115
[3] Baker LD, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 96-103
[4] Barbará D, Li Y, Couto J (2002) Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of the eleventh international conference on information and knowledge management. ACM, pp 582-589
[5] Baxter, RA; Oliver, JJ, Finding overlapping components with mml, Stat Comput, 10, 1, 5-16 (2000)
[6] Benaglia, T.; Chauveau, D.; Hunter, D.; Young, D., mixtools: an r package for analyzing finite mixture models, J Stat Softw, 32, 6, 1-29 (2009)
[7] Bontemps, D.; Toussile, W., Clustering and variable selection for categorical multivariate data, Electron J Stat, 7, 2344-2371 (2013) · Zbl 1349.62259
[8] Bouguila, N., On multivariate binary data clustering and feature weighting, Comput Stat Data Anal, 54, 1, 120-134 (2010) · Zbl 1284.62364
[9] Bouguila, N.; ElGuebaly, W., Discrete data clustering using finite mixture models, Pattern Recognit, 42, 1, 33-42 (2009) · Zbl 1162.68598
[10] Bouguila, N.; Ziou, D., High-dimensional unsupervised selection and estimation of a finite generalized dirichlet mixture model based on minimum message length, IEEE Trans Pattern Anal Mach Intell, 29, 10, 1716-1731 (2007)
[11] Bozdogan, H.; Sclove, SL, Multi-sample cluster analysis using akaike’s information criterion, Ann Inst Stat Math, 36, 1, 163-180 (1984) · Zbl 0552.62050
[12] Cagnone, S.; Viroli, C., A factor mixture analysis model for multivariate binary data, Stat Model, 12, 3, 257-277 (2012) · Zbl 07257879
[13] Celeux, G.; Govaert, G., Clustering criteria for discrete data and latent class models, J Classif, 8, 2, 157-176 (1991) · Zbl 0775.62150
[14] Chan, EY; Ching, WK; Ng, MK; Huang, JZ, An optimization algorithm for clustering using weighted dissimilarity measures, Pattern Recognit, 37, 5, 943-952 (2004) · Zbl 1072.68549
[15] Chawla NV (2009) Data mining for imbalanced datasets: An overview. In: Data mining and knowledge discovery handbook. Springer, pp 875-886
[16] Chen, L.; Wang, S.; Wang, K.; Zhu, J., Soft subspace clustering of categorical data with probabilistic distance, Pattern Recognit, 51, 322-332 (2016)
[17] Cover, TM; Thomas, JA, Elements of information theory (2012), Hoboken: Wiley, Hoboken
[18] Dhillon IS, Guan Y (2003) Information theoretic clustering of sparse cooccurrence data. In: Third IEEE international conference on data mining, 2003 (ICDM 2003). IEEE, pp 517-520
[19] dos Santos, TR; Zárate, LE, Categorical data clustering: what similarity measure to recommend?, Exp Syst Appl, 42, 3, 1247-1260 (2015)
[20] Elmore, RT; Hettmansperger, TP; Thomas, H., Estimating component cumulative distribution functions in finite mixture models, Commun Stat Theory Methods, 33, 9, 2075-2086 (2004) · Zbl 1215.62036
[21] Ewing, T.; Baber, JC; Feher, M., Novel 2d fingerprints for ligand-based virtual screening, J Chem Inf Model, 46, 6, 2423-2431 (2006)
[22] Fränti, P.; Xu, M.; Kärkkäinen, I., Classification of binary vectors by using \(\delta \text{sc}\) distance to minimize stochastic complexity, Pattern Recognit Lett, 24, 1, 65-73 (2003) · Zbl 1054.68120
[23] Ghosh JK, Delampady M, Samanta T (2006) Hypothesis testing and model selection. Theory and methods: an introduction to Bayesian analysis, pp 159-204 · Zbl 1135.62002
[24] Gollini, I.; Murphy, TB, Mixture of latent trait analyzers for model-based clustering of categorical data, Stat Comput, 24, 4, 569-588 (2014) · Zbl 1325.62122
[25] Graham, MW; Miller, DJ, Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection, IEEE Trans Signal Process, 54, 4, 1289-1303 (2006) · Zbl 1373.94601
[26] Grantham NS (2014) Clustering binary data with bernoulli mixture models. Unpublished written preliminary exam. NC State University
[27] Guansong Pang, SJ; Jin, Huidong, Cenknn: a scalable and effective text classifier, Data Min Knowl Discov, 29, 3, 593-625 (2015)
[28] Hartigan, JA; Wong, MA, Algorithm as 136: A k-means clustering algorithm, Appl Stat, 28, 100-108 (1979) · Zbl 0447.62062
[29] He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: Advances in neural information processing systems, pp 507-514
[30] Huang, Z., Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Min Knowl Discov, 2, 3, 283-304 (1998)
[31] Hubert, L.; Arabie, P., Comparing partitions, J Classif, 2, 1, 193-218 (1985)
[32] Indhumathi, R.; Sathiyabama, S., Reducing and clustering high dimensional data through principal component analysis, Int J Comput Appl, 11, 8, 1-4 (2010)
[33] Jang E, Gu S, Poole B (2017) Categorical reparametrization with gumble-softmax. In: International conference on learning representations 2017. http://OpenReviews.net/
[34] Juan, A.; Vidal, E., On the use of bernoulli mixture models for text classification, Pattern Recognit, 35, 12, 2705-2710 (2002) · Zbl 1010.68154
[35] Karypis G (2002) Cluto-a clustering toolkit. Techical report, DTIC document
[36] Klekota, J.; Roth, FP, Chemical substructures that enrich for biological activity, Bioinformatics, 24, 21, 2518-2525 (2008)
[37] Langseth, H.; Nielsen, TD, Latent classification models for binary data, Pattern Recognit, 42, 11, 2724-2736 (2009) · Zbl 1175.68368
[38] LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P., Gradient-based learning applied to document recognition, Proc IEEE, 86, 11, 2278-2324 (1998)
[39] Lewis, DD; Yang, Y.; Rose, TG; Li, F., Rcv1: a new benchmark collection for text categorization research, J Mach Learn Res, 5, Apr, 361-397 (2004)
[40] Li T (2005) A general model for clustering binary data. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, pp 188-197
[41] Li X, Roth D (2002) Learning question classifiers. In: Proceedings of the 19th international conference on Computational linguistics-volume 1. Association for Computational Linguistics, pp 1-7
[42] Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of the twenty-first international conference on machine learning. ACM, p 68
[43] Li, CG; You, C.; Vidal, R., Structured sparse subspace clustering: ajoint affinity learning and subspace clustering framework, IEEE Trans Image Process, 26, 6, 2988-3001 (2017) · Zbl 1409.94342
[44] Li CG, You C, Vidal R (2018) On geometric analysis of affine sparse subspace clustering. arXiv preprint arXiv:1808.05965
[45] Lu CY, Min H, Zhao ZQ, Zhu L, Huang DS, Yan S (2012) Robust and efficient subspace segmentation via least squares regression. In: European conference on computer vision. Springer, pp 347-360
[46] MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, vol 1, pp 281-297 · Zbl 0214.46201
[47] Mali, K.; Mitra, S., Clustering and its validation in a symbolic framework, Pattern Recognit Lett, 24, 14, 2367-2376 (2003) · Zbl 1047.68132
[48] McLachlan, GJ, On bootstrapping the likelihood ratio test stastistic for the number of components in a normal mixture, Appl Stati, 36, 318-324 (1987)
[49] McLachlan, G.; Peel, D., Finite mixture models (2004), Hoboken: Wiley, Hoboken
[50] Nobile, A., On the posterior distribution of the number of components in a finite mixture, Ann Stat, 32, 5, 2044-2073 (2004) · Zbl 1056.62037
[51] Nobile, A.; Fearnside, AT, Bayesian finite mixtures with an unknown number of components: the allocation sampler, Stat Comput, 17, 2, 147-162 (2007)
[52] Papadopoulos, S.; Kompatsiaris, Y.; Vakali, A.; Spyridonos, P., Community detection in social media, Data Min Knowl Discov, 24, 3, 515-554 (2012)
[53] Plumbley MD (2002) Clustering of sparse binary data using a minimum description length approach
[54] Rahmani M, Atia G (2017) Innovation pursuit: a new approach to the subspace clustering problem. In: International conference on machine learning, pp 2874-2882 · Zbl 1414.62266
[55] Ribeiro MT, Singh S, Guestrin C (2016) Why should i trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1135-1144
[56] Richardson, S.; Green, PJ, On bayesian analysis of mixtures with an unknown number of components (with discussion), J R Stat Soc Ser B (Stat Methodol), 59, 4, 731-792 (1997) · Zbl 0891.62020
[57] Schwarz, G., Estimating the dimension of a model, Aann Stat, 6, 2, 461-464 (1978) · Zbl 0379.62005
[58] Serrà J, Karatzoglou A (2017) Getting deep recommenders fit: Bloom embeddings for sparse binary input/output networks. In: Proceedings of the eleventh ACM conference on recommender systems. ACM, pp 279-287
[59] Silvestre, C.; Cardoso, MG; Figueiredo, M., Feature selection for clustering categorical data with an embedded modelling approach, Exp Syst, 32, 3, 444-453 (2015)
[60] Silvestre C, Cardoso MG, Figueiredo MA (2014) Identifying the number of clusters in discrete mixture models. arXiv preprint arXiv:1409.7419
[61] Smieja, M.; Tabor, J., Entropy of the mixture of sources and entropy dimension, IEEE Trans Inf Theory, 58, 5, 2719-2728 (2012) · Zbl 1365.94164
[62] Śmieja M, Nakoneczny S, Tabor J (2016) Fast entropy clustering of sparse high dimensional binary data. In: 2016 International joint conference on neural networks (IJCNN). IEEE, pp 2397-2404
[63] Śmieja, M.; Geiger, BC, Semi-supervised cross-entropy clustering with information bottleneck constraint, Inf Sci, 421, 254-271 (2017) · Zbl 1436.62286
[64] Spurek, P., General split gaussian cross-entropy clustering, Exp Syst Appl, 68, 58-68 (2017)
[65] Spurek, P.; Tabor, J.; Byrski, K., Active function cross-entropy clustering, Exp Syst Appl, 72, 49-66 (2017)
[66] Steinbach M, Ertöz L, Kumar V (2004) The challenges of clustering high dimensional data. In: New directions in statistical physics. Springer, pp 273-309 · Zbl 1078.62066
[67] Strouse D, Schwab DJ (2016) The deterministic information bottleneck. In: Proceedings conference on uncertainty in artificial intelligence (UAI), New York City, NY, pp 696-705 · Zbl 1478.68082
[68] Struski, Ł.; Tabor, J.; Spurek, P., Lossy compression approach to subspace clustering, Inf Sci, 435, 161-183 (2017) · Zbl 1440.68074
[69] Tabor, J.; Spurek, P., Cross-entropy clustering, Pattern Recognit, 47, 9, 3046-3059 (2014) · Zbl 1342.68279
[70] Tang, Y.; Browne, RP; McNicholas, PD, Model based clustering of high-dimensional binary data, Comput Stat Data Anal, 87, 84-101 (2015) · Zbl 1468.62191
[71] Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proceddings Allerton conference on communication, control, and computing, Monticello, IL, pp 368-377
[72] Tsakiris MC, Vidal R (2017) Hyperplane clustering via dual principal component pursuit. In: International conference on machine learning, pp 3472-3481
[73] Vermunt JK (2007) Multilevel mixture item response theory models: an application in education testing. In: Proceedings of the 56th session of the International Statistical Institute Lisbon, Portugal 2228
[74] Vinh, NX; Epps, J.; Bailey, J., Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance, J Mach Learn Res, 11, Oct, 2837-2854 (2010) · Zbl 1242.62062
[75] Warszycki, D.; Mordalski, S.; Kristiansen, K.; Kafel, R.; Sylte, I.; Chilmonczyk, Z.; Bojarski, AJ, A linear combination of pharmacophore hypotheses as a new tool in search of new active compounds—an application for \(5- \text{ ht }_{1A}\) receptor ligands, PLoS ONE, 8, 12, e84,510 (2013) · doi:10.1371/journal.pone.0084510
[76] Wen, JR; Nie, JY; Zhang, HJ, Query clustering using user logs, ACM Trans Inf Syst, 20, 1, 59-81 (2002)
[77] Yamamoto, M.; Hayashi, K., Clustering of multivariate binary data with dimension reduction via l 1-regularized likelihood maximization, Pattern Recognit, 48, 12, 3959-3968 (2015) · Zbl 1394.68322
[78] Yap, CW, Padel-descriptor: an open source software to calculate molecular descriptors and fingerprints, J Comput Chem, 32, 7, 1466-1474 (2011)
[79] Ying Zhao, UF; Karypis, George, Hierarchical clustering algorithms for document datasets, Data Min Knowl Discov, 10, 2, 141-168 (2005)
[80] You C, Li CG, Robinson DP, Vidal R (2016) Oracle based active set algorithm for scalable elastic net subspace clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3928-3937
[81] Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the eleventh international conference on information and knowledge management. ACM, pp 515-524
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.