×

Is-ClusterMPP: clustering algorithm through point processes and influence space towards high-dimensional data. (English) Zbl 1474.62230

Summary: Clustering via marked point processes and influence space, Is-ClusterMPP, is a new unsupervised clustering algorithm through adaptive MCMC sampling of a marked point processes of interacting balls. The designed Gibbs energy cost function makes use of \(k\)-influence space information. It detects clusters of different shapes, sizes and unbalanced local densities. It aims at dealing also with high-dimensional datasets. By using the \(k\)-influence space, Is-ClusterMPP solves the problem of local heterogeneity in densities and prevents the impact of the global density in the detection of unbalanced classes. This concept reduces also the input values amount. The curse of dimensionality is handled by using a local subspace clustering principal embedded in a weighted similarity metric. Balls covering data points are constituting a configuration sampled from a marked point process (MPP). Due to the choice of the energy function, they tends to cover neighboring data, which share the same cluster. The statistical model of random balls is sampled through a Monte Carlo Markovian dynamical approach. The energy is balancing different goals. (1) The data driven objective function is provided according to \(k\)-influence space. Data in a high-dense region are favored to be covered by a ball. (2) An interaction part in the energy prevents the balls full overlap phenomenon and favors connected groups of balls. The algorithm through Markov dynamics, does converge towards configurations sampled from the MPP model. This algorithm has been applied in real benchmarks through gene expression data set of various sizes. Different experiments have been done to compare Is-ClusterMPP against the most well-known clustering algorithms and its efficiency is claimed.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H11 Directional data; spatial statistics
60G55 Point processes (e.g., Poisson, Cox, Hawkes processes)
65C05 Monte Carlo methods
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Adovanovic, M.; Nanopoulos, A.; Ivanovic, M., Hubs in space: popular nearest neighbors in high-dimensional data, J Mach Learn Res, 11, 2487-2531 (2010) · Zbl 1242.62056
[2] Alata, O.; Burg, S.; Dupas, A., Grouping/degrouping point process, a point process driven by geometrical and topological properties of a partition in regions, Comput Vis Image Underst, 115, 9, 1324-1339 (2011)
[3] Baddeley A, Rubak E, Turner R (2015) Spatial point patterns: methodology and applications with R. Chapman and Hall/CRC. ISBN 9781482210200
[4] Bar-Hen, A.; Emily, M.; Picard, N., Spatial cluster detection using nearest neighbor distance, Spat Stat, 14, Part C, 400-411 (2015)
[5] Bhagat, PM; Halgaonkar, PS; Wadhai, VM, Review of clustering algorithm for categorical data, Int J Eng Adv Technol, 32, 2249-8958 (2013)
[6] Bhole, P.; Agrawal, AJ, Extractive based single document text summarization using clustering approach, IAES Int J Artif Intell, 3, 2, 73-78 (2014)
[7] Bisheng, Y.; Wenxue, X.; Zhen, D., Automated extraction of building outlines from airborne laser scanning point clouds, Geosci Remote Sens Lett IEEE, 10, 6, 1399-1403 (2013)
[8] Bohm C, Railing K, Kriegel H-P, Kroger P (2004) Density connected clustering with local subspace preferences. In: Fourth IEEE international conference on data mining (ICDM’04), Icdm 04, pP 27-34. ISBN 0-7695-2142-8
[9] Bouveyron, C.; Girard, S.; Schmid, C., High-dimensional data clustering, Comput Stat Data Anal, 52, 1, 502-519 (2007) · Zbl 1452.62433
[10] Campello RJGB, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 7819 LNAI, pp 160-172. ISBN 9783642374555
[11] Cassisi, C.; Ferro, A.; Giugno, R.; Pigola, G.; Pulvirenti, A., Enhancing density-based clustering: parameter reduction and outlier detection, Inf Syst, 38, 3, 317-330 (2013)
[12] Chin, YC; Baddeley, AJ, On connected component Markov point processes, Adv Appl Probab, 31, 2, 279-282 (1999) · Zbl 0953.60033
[13] Chiu SN, Stoyan D, Kendall WS, Mecke J (2013) Stochastic geometry and its applications. Wiley, Lodon. ISBN 978-0-470-66481-0 · Zbl 1291.60005
[14] Datar M, Immorlica N, Indyk P (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: SCG’04, June 9-11, 2004, Brooklyn, New York, USA., pp 253-262. ISBN 1581138857 · Zbl 1373.68193
[15] Descombes, X.; Zerubia, J., Marked point processes in image analysis, IEEE Signal Process Mag, 19, 5, 77-84 (2002)
[16] Descombes, X.; Zhizhina, E., Applications of Gibbs fields methods to image processing problems, Probl Inf Transm, 40, 3, 279-295 (2004) · Zbl 1088.94503
[17] Descombes, X.; Stoica, R.; Garcin, L.; Zerubia, J., A RJMCMC algorithm for object processes in image processing, Monte Carlo Methods Appl, 7, 1-2, 149-156 (2001) · Zbl 0977.68093
[18] Descombes, X.; Minlos, R.; Zhizhina, E., Object extraction using a Stochastic birth-and-death dynamics in continuum, J Math Imaging Vis, 33, 3, 347-359 (2009)
[19] El Sonbaty Y, Ismail MA, Farouk M (2004) An efficient density based clustering algorithm for large databases. In: 16th IEEE international conference on tools with artificial intelligence, Ictai, pp 637-677. ISBN 0-7695-2236-X
[20] Elavarasi, SA; Akilandeswari, J.; Sathiyabhama, B., A survey on partition clustering algorithms, Learning, 1, 1, 1-14 (2011)
[21] Elbatta, MTH; Ashour, WM, A dynamic method for discovering density varied clusters, Int J Signal Process Image Process Pattern Recognit, 6, 1, 123-134 (2013)
[22] Elgendy, N.; Elragal, A., Big data analytics: a literature review paper, Adv Data Min Appl Theor Asp, 8557, 214-227 (2014)
[23] Everitt, BS; Landau, S.; Leese, M.; Stahl, D., Cluster Analysis (2011), London: Wiley, London · Zbl 1274.62003
[24] Fahad, A.; Alshatri, N.; Tari, Z.; Alamri, A.; Khalil, I.; Zomaya, A.; Foufou, S.; Bouras, A., A survey of clustering algorithms for big data: taxonomy and empirical analysis, IEEE Trans Emerg Top Comput, 2, 3, 267-279 (2014)
[25] Fahim, A.; Salem, A., Density clustering based on radius of data (DCBRD), Informatika, 16, 344-349 (2006)
[26] Flexer, A.; Schnitzer, D., Choosing \(L^p\) norms in high-dimensional spaces based on hub analysis, Neurocomputing, 169, 281-287 (2015)
[27] Fraley, C.; Raftery, AE, Model-based clustering, discriminant analysis, and density estimation, J Am Stat Assoc, 97, 458, 611-631 (2002) · Zbl 1073.62545
[28] Gaonkar, MN; Sawant, K., Auto Eps DBSCAN: DBSCAN with Eps automatic for large dataset, Int J Adv Comput Theory Eng, 2, 2, 2319-2526 (2013)
[29] Geyer CJ (2003) The Metropolis-Hastings-Green algorithm. http://www.stat.umn.edu/geyer/f05/8931/n1998.pdf
[30] Green, PJ, Reversible jump Markov chain monte carlo computation and Bayesian model determination, Biometrika, 82, 4, 711-732 (1995) · Zbl 0861.62023
[31] Guibas, L.; Morozov, D.; Mérigot, Q., Witnessed k-distance, Discrete Comput Geom, 49, 1, 22-45 (2013) · Zbl 1272.68414
[32] Gul, A.; Perperoglou, A.; Khan, Z.; Mahmoud, O.; Miftahuddin, M.; Adler, W.; Lausen, B., Ensemble of a subset of knn classifiers, Adv Data Anal Classif, 12, 4, 827-840 (2018) · Zbl 1416.62338
[33] Henni K, Alata O, Alidrissi A, Vannier B, Zaoui L, Moussa A (2017a) Marked point processes for MicroArray data clustering. In: Palumbo F, Montanari A, Vichi M (eds) Data science—innovative developments in data analysis and clustering: studies in classification, data analysis and knowledge organization, pp 125-137. Springer, Berlin. ISBN 978-3-319-55723-6
[34] Henni, K.; Alata, O.; Zaoui, L.; Vannier, B.; Alidrissi, A.; Moussa, A., ClusterMPP: an unsupervised density-based clustering algorithm via marked point process, Intell Data Anal, 21, 4, 827-847 (2017)
[35] Ilango, M.; Mohan, V., A survey of grid based clustering algorithms, Int J Eng Sci Technol, 2, 8, 3441-3446 (2010)
[36] Kelly, FP; Ripley, BD, A note on Strauss’s model for clustering, Biometrika, 63, 63, 357-360 (1976) · Zbl 0332.60034
[37] Lv, Y.; Ma, T.; Tang, M.; Cao, J.; Tian, Y., An efficient and scalable density-based clustering algorithm for datasets with complex structures, Neurocomputing, 171, 9-22 (2016)
[38] Martin E, Hans-Peter K, Jörg S, Xiaowei X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Second international conference on knowledge discovery and data mining, pp 226-231. AAAI Press
[39] Møller J, Waagepetersen RP (2004) Statistical inference and simulation for spatial point processes. Chapman and Hall/CRC, London. ISBN 9781584882657 · Zbl 1044.62101
[40] Montanari, A.; Calò, DG, Model-based clustering of probability density functions, Adv Data Anal Classif, 7, 3, 301-319 (2013) · Zbl 1273.62140
[41] Moussa, A.; Sbihi, A.; Postaire, JG, A Markov random field model for mode detection in cluster analysis, Pattern Recognit Lett, 29, 9, 1197-1207 (2008)
[42] Murtagh, F.; Contreras, P., Algorithms for hierarchical clustering: an overview, Wiley Interdiscip Rev Data Min Knowl Discov, 2, 1, 86-97 (2012)
[43] Ng, RT; Han, J., CLARANS: a method for clustering objects for spatial data mining, IEEE Trans Knowl Data Eng, 14, 5, 1003-1016 (2002)
[44] Ortner, M.; Descombes, X.; Zerubia, J., Building outline extraction from Digital Elevation Models using marked point processes, Int J Comput Vis, 72, 107-132 (2007)
[45] Peng L, Dong Z, Naijun W (2007) VDBSCAN: varied density based spatial clustering of applications with noise. In: Proceedings—ICSSSM’07: 2007 international conference on service systems and service management. ISBN 1424408857
[46] Ram, A.; Jalal, S.; Jalal, AS; Kumar, M., DVBSCAN: a density based algorithm for discovering density varied clusters in large spatial databases, Int J Comput Appl, 3, 6, 1-4 (2010)
[47] Rehman SU, Asghar S, Fong S, Sarasvady S (2014) DBSCAN: past, present and future. In: The fifth international conference on the applications of digital information and web technologies (ICADIWT 2014), pp 232-238
[48] Rodriguez, A.; Laio, A., Clustering by fast search and find of density peaks—PPT, Science, 344, 6191, 1492-1496 (2014)
[49] Samé, A.; Ambroise, C.; Govaert, G., An online classification EM algorithm based on the mixture model, Stat Comput, 17, 3, 209-218 (2007)
[50] Stoica RS (2014) Modélisation probabiliste et inférence statistique pour l’analyse des données spatialisées. Research Habilitation Thesis, Université Lille 1
[51] Stoica, RS; Descombes, X.; Zerubia, J., A Gibbs point process for road extraction from remotely sensed images, Int J Comput Vis, 57, 2, 121-136 (2004)
[52] Stoica, RS; Gay, E.; Kretzschmar, A., Cluster pattern detection in spatial data based on Monte Carlo inference, Biom J, 49, 4, 505-519 (2007) · Zbl 1442.62637
[53] Stoica, RS; Martinez, VJ; Saar, E., Filaments in observed and mock galaxy catalogues, Astron Astrophys, 510, 1-12 (2010)
[54] Sun S, Huang R (2010) An adaptive k-nearest neighbor algorithm. In: Proceedings—2010 7th international conference on fuzzy systems and knowledge discovery, FSKD 2010, vol 1, pp 91-94. ISBN 9781424459346
[55] Uncu O, Gruver WA, Kotak DB, Sabaz D, Alibhai Z, Ng C (2007) GRIDBSCAN: GRId density-based spatial clustering of applications with noise. In: Conference Proceedings—IEEE International conference on systems, man and cybernetics, vol 4, pp 2976-2981. ISBN 1424401003
[56] van Lieshout, MNM, Markov point processes and their applications (2000), London: Imperial College Press, London
[57] van Lieshout, MNM; Stoica, RS, Perfect simulation for marked point processes, Comput Stat Data Anal, 51, 679-698 (2006) · Zbl 1157.65311
[58] Xiaopeng Y, Deyi Z, Yan Z (2005) A new clustering algorithm based on distance and density. In: International conference on services systems and services management, vol 2
[59] Zhang Q, Yang LT, Chen Z, Xia F (2015) A high-order possibilistic-means algorithm for clustering incomplete multimedia data. IEEE Syst J PP(99):1-10
[60] Zhu, X.; Melnykov, V., Probabilistic assessment of model-based clustering, Adv Data Anal Classif, 9, 4, 395-422 (2015) · Zbl 1414.62290
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.