×

Noise-free latent block model for high dimensional data. (English) Zbl 1458.68171

Summary: Co-clustering is known to be a very powerful and efficient approach in unsupervised learning because of its ability to partition data based on both the observations and the variables of a given dataset. However, in high-dimensional context co-clustering methods may fail to provide a meaningful result due to the presence of noisy and/or irrelevant features. In this paper, we tackle this issue by proposing a novel co-clustering model which assumes the existence of a noise cluster, that contains all irrelevant features. A variational expectation-maximization-based algorithm is derived for this task, where the automatic variable selection as well as the joint clustering of objects and variables are achieved via a Bayesian framework. Experimental results on synthetic datasets show the efficiency of our model in the context of high-dimensional noisy data. Finally, we highlight the interest of the approach on two real datasets which goal is to study genetic diversity across the world.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDFBibTeX XMLCite
Full Text: DOI HAL

References:

[1] Baudry JP, Celeux G, Marin JM (2008) Selecting models focussing on the modeller purpose. In: COMPSTAT 2008, Springer, pp 337-348 · Zbl 1147.62347
[2] Ben-David S, Haghtalab N (2014) Clustering in the presence of background noise. In: Proceedings of ICML, pp 280-288
[3] Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. PAMI 22(7):719-725 · doi:10.1109/34.865189
[4] Bouveyron C, Brunet-Saumard C (2014) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52-78 · Zbl 1471.62032 · doi:10.1016/j.csda.2012.12.008
[5] Brault V, Keribin C, Mariadassou M (2017) Consistency and asymptotic normality of latent blocks model estimators. arXiv preprint arXiv:1704.06629
[6] Celeux G, Martin-Magniette ML, Maugis C, Raftery AE (2011) Letter to the editor: “a framework for feature selection in clustering”. J Am Stat Assoc 106:383 · Zbl 1430.62126 · doi:10.1198/jasa.2011.tm10681
[7] Cuesta-Albertos JA, Gordaliza A, Matràn C (1997) Trimmed \[k\] k-means: an attempt to robustify quantizers. Ann Stat 25(2):553-576 · Zbl 0878.62045 · doi:10.1214/aos/1031833664
[8] Dave RN (1991) Characterization and detection of noise in clustering. Pattern Recognit Lett 12(11):657-664 · doi:10.1016/0167-8655(91)90002-4
[9] Dave RN (1993) Robust fuzzy clustering algorithms. In: [Proceedings 1993] Second IEEE international conference on fuzzy systems, vol 2, pp 1281-1286
[10] Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, AAAI Press, pp 226-231
[11] Frühwirth-Schnatter, S.; Mengersen, KL (ed.); Robert, CP (ed.); Titterington, DM (ed.), Dealing with label switching under model uncertainty, 213-239 (2011), Hoboken · doi:10.1002/9781119995678.ch10
[12] García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36(3):1324-1345 · Zbl 1360.62328 · doi:10.1214/07-AOS515
[13] García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010) A review of robust clustering methods. Adv Data Anal Classif 4(2):89-109 · Zbl 1284.62375 · doi:10.1007/s11634-010-0064-5
[14] Govaert G, Nadif M (2003) Clustering with block mixture models. Pattern Recognit 36:463-473 · Zbl 1452.62444 · doi:10.1016/S0031-3203(02)00074-2
[15] Govaert G, Nadif M (2008) Block clustering with Bernoulli mixture models: comparison of different approaches. Comput Stat Data Anal 52(6):3233-3245 · Zbl 1452.62444 · doi:10.1016/j.csda.2007.09.007
[16] Govaert G, Nadif M (2013) Co-clustering. Wiley, Hoboken · Zbl 1416.62309 · doi:10.1002/9781118649480
[17] Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67(337):123-129 · doi:10.1080/01621459.1972.10481214
[18] Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303-1347 · Zbl 1317.68163
[19] Keribin C, Brault V, Celeux G, Govaert G (2015) Estimation and selection for the latent block model on categorical data. Stat Comput 25(6):1201-1216 · Zbl 1331.62149 · doi:10.1007/s11222-014-9472-2
[20] Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26:1154-1166 · doi:10.1109/TPAMI.2004.71
[21] Li M, Zhang L (2008) Multinomial mixture model with feature selection for text clustering. Knowl Based Syst 21(7):704-708 · doi:10.1016/j.knosys.2008.03.025
[22] Maugis C, Celeux G, Martin-Magniette ML (2009) Variable selection for clustering with gaussian mixture models. Biometrics 65(3):701-709 · Zbl 1172.62021 · doi:10.1111/j.1541-0420.2008.01160.x
[23] Mirkin BG (1996) Mathematical classification and clustering. Nonconvex optimization and its applications. Kluwer academic publishers, Dordrecht · Zbl 0874.90198 · doi:10.1007/978-1-4613-0457-9
[24] Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145-1164 · Zbl 1222.68279
[25] Patrikainen A, Meila M (2006) Comparing subspace clusterings. IEEE Trans Knowl Data Eng 18(7):902-916 · doi:10.1109/TKDE.2006.106
[26] Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101:168-178 · Zbl 1118.62339 · doi:10.1198/016214506000000113
[27] Robert V, Vasseur Y (2017) Comparing high dimensional partitions, with the co-clustering adjusted rand index. arXiv:1705.06760
[28] Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW (2002) Genetic structure of human populations. Science 298(5602):2381-2385 · doi:10.1126/science.1078311
[29] Wang S, Zhu J (2008) Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64(2):440-448 · Zbl 1137.62041 · doi:10.1111/j.1541-0420.2007.00922.x
[30] Wang S, Lewis CM, Jakobsson M, Ramachandran S, Ray N, Bedoya G, Rojas W, Parra MV, Molina JA, Gallo C, Mazzotti G, Poletti G, Hill K, Hurtado AM, Labuda D, Klitz W, Barrantes R, Bortolini MC, Salzano FM, Petzl-Erler ML, Tsuneto LT, Llop E, Rothhammer F, Excoffier L, Feldman MW, Rosenberg NA, Ruiz-Linares A (2007) Genetic variation and population structure in native Americans. PLoS Genet 3(11):e185 · doi:10.1371/journal.pgen.0030185
[31] Wang X, Kabán A (2005) Finding uninformative features in binary data. Intell Data Eng Autom Learn IDEAL 2005:40-47
[32] Wyse J, Friel N (2012) Block clustering with collapsed latent block models. Stat Comput 22(2):415-428 · Zbl 1322.62046 · doi:10.1007/s11222-011-9233-4
[33] Wyse J, Friel N, Latouche P (2017) Inferring structure in bipartite networks using the latent blockmodel and exact ICL. Netw Sci 5(1):45-69. https://doi.org/10.1017/nws.2016.25 · doi:10.1017/nws.2016.25
[34] Zhou H, Pan W, Shen X (2009) Penalized model-based clustering with unconstrained covariance matrices. Electron J Stat 3:1473-1496 · Zbl 1326.62143 · doi:10.1214/09-EJS487
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.