×

An advancement in clustering via nonparametric density estimation. (English) Zbl 1322.62175

Summary: Density-based clustering methods hinge on the idea of associating groups to the connected components of the level sets of the density underlying the data, to be estimated by a nonparametric method. These methods claim some desirable properties and generally good performance, but they involve a non-trivial computational effort, required for the identification of the connected regions. In a previous work, the use of spatial tessellation such as the Delaunay triangulation has been proposed, because it suitably generalizes the univariate procedure for detecting the connected components. However, its computational complexity grows exponentially with the dimensionality of data, thus making the triangulation unfeasible for high dimensions. Our aim is to overcome the limitations of Delaunay triangulation. We discuss the use of an alternative procedure for identifying the connected regions associated to the level sets of the density. By measuring the extent of possible valleys of the density along the segment connecting pairs of observations, the proposed procedure shifts the formulation from a space with arbitrary dimension to a univariate one, thus leading benefits both in computation and visualization.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G07 Density estimation
60D05 Geometric probability and stochastic geometry
52B55 Computational aspects related to convexity

Software:

mclust; pdfCluster; R; mixsmsn
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Abramson, I.S.: On bandwidth variation in kernel estimates—a square root law. Ann. Stat. 10, 1217-1223 (1982) · Zbl 0507.62040 · doi:10.1214/aos/1176345986
[2] Azzalini, A., Torelli, N.: Clustering via nonparametric density estimation. Stat. Comput. 17, 71-80 (2007) · doi:10.1007/s11222-006-9010-y
[3] Azzalini, A., Menardi, G., Rosolin, T.: R package pdfCluster: cluster analysis via nonparametric density estimation (version 1.0-0) (2012). http://cran.r-project.org/package=pdfCluster · Zbl 1055.62075
[4] Burman, P., Polonik, W.: Multivariate mode hunting: data analytic tools with measures of significance. J. Multivar. Anal. 100, 1198-1218 (2009) · Zbl 1159.62032 · doi:10.1016/j.jmva.2008.10.015
[5] Cuevas, A., Febrero, M., Fraiman, R.: Estimating the number of clusters. Can. J. Stat. 28, 367-382 (2000) · Zbl 0981.62054 · doi:10.2307/3315985
[6] Cuevas, A., Febrero, M., Fraiman, R.: Cluster analysis: a further approach based on density estimation. Comput. Stat. Data Anal. 36, 441-459 (2001) · Zbl 1053.62537 · doi:10.1016/S0167-9473(00)00052-9
[7] Dazard, J.E., Rao, J.S.: Local sparse bump hunting. J. Comput. Graph. Stat. 19, 900-929 (2010) · doi:10.1198/jcgs.2010.09029
[8] De la Cruz, R.: Bayesian non-linear regression models with skew-elliptical errors: applications to the classification of longitudinal profiles. Comput. Stat. Data Anal. 53, 436-449 (2008) · Zbl 1231.62125 · doi:10.1016/j.csda.2008.08.027
[9] Du, Q., Faber, V., Gunzburger, M.: Centroidal Voronoi tessellations: applications and algorithms. SIAM Rev. 41, 637-676 (1999) · Zbl 0983.65021 · doi:10.1137/S0036144599352836
[10] Even, S.: Graph Algorithms, 2nd edn. Cambridge University Press, New York (2011) · Zbl 1237.05199 · doi:10.1017/CBO9781139015165
[11] Forina, M.; Armanino, C.; Lanteri, S.; Tiscornia, E., Classication of olive oils from their fatty acid composition, 189-214 (1983), London
[12] Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25, 189-201 (1986)
[13] Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611-631 (2002) · Zbl 1073.62545 · doi:10.1198/016214502760047131
[14] Fraley, C., Raftery, A.E.: MCLUST vers. 3 for R: normal mixture modeling and model-based clustering. Tech. Rep. 504, Univ. Washington, Dep. Stat. (2006), rev. 2009 · Zbl 1073.62545
[15] Friedman, J.H.: On bias, variance, 0-1 loss, and the curse of dimensionality. Data Min. Knowl. Discov. 1, 55-77 (1997) · doi:10.1023/A:1009778005914
[16] Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. J. R. Stat. Soc., Ser. C, Appl. Stat. 18, 54-64 (1969)
[17] Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975) · Zbl 0372.62040
[18] Kriegel, H., Kröger, P., Sander, J., Zimek, A.: Density-based clustering. Data Min. Knowl. Discov. 1, 231-240 (2011) · doi:10.1002/widm.30
[19] Krznaric, D., Levcopoulos, C.: Fast algorithms for complete linkage clustering. Discrete Comput. Geom. 19, 131-145 (1998) · Zbl 0892.68103 · doi:10.1007/PL00009332
[20] Lubischew, A.A.: On the use of discriminant analysis in taxonomy. Biometrics 18, 455-477 (1962) · Zbl 0112.11602 · doi:10.2307/2527894
[21] Menardi, G., Torelli, N.: Reducing data dimension for cluster detection. J. Stat. Comput. Simul. (2012). doi:10.1080/00949655.2012.679032 · doi:10.1080/00949655.2012.679032
[22] Minnotte, M.C.: Nonparametric testing for the existence of modes. Ann. Stat. 25, 1646-1660 (1997) · Zbl 0936.62056 · doi:10.1214/aos/1031594735
[23] Müller, D.W., Sawitzki, G.: Excess mass estimates and tests for multimodality. J. Am. Stat. Assoc. 86, 738-746 (1991) · Zbl 0733.62040
[24] Prates, M., Lachos, V., Cabral, C.: R package mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions (version 1.0-3) (2012). http://cran.r-project.org/web/packages/mixsmsn/index.html · Zbl 0979.62046
[25] Ooi, H.: Density visualization and mode hunting using trees. J. Comput. Graph. Stat. 11, 328-347 (2002) · doi:10.1198/106186002760180545
[26] R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2013). http://www.R-project.org/. ISBN 3-900051-07-3-900051-0 · Zbl 0507.62040
[27] Rinaldo, A., Wasserman, L.: Generalized density clustering. Ann. Stat. 38, 2678-2722 (2010) · Zbl 1200.62066 · doi:10.1214/10-AOS797
[28] Sahu, S.K., Dey, D.K., Branco, M.D.: A new class of multivariate skew distributions with applications to Bayesian regression models. Can. J. Stat. 31, 129-150 (2003) · Zbl 1039.62047 · doi:10.2307/3316064
[29] Scott, D.W., Sain, S.: Multidimensional Density Estimation. Handbook of Statistics, vol. 24, pp. 229-261 (2005)
[30] Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, New York (1986) · Zbl 0617.62042 · doi:10.1007/978-1-4899-3324-9
[31] Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20, 25-47 (2003) · Zbl 1055.62075 · doi:10.1007/s00357-003-0004-6
[32] Stuetzle, W., Nugent, R.: A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Graph. Stat. 19, 397-418 (2010) · doi:10.1198/jcgs.2009.07049
[33] Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the GAP statistic. J. R. Stat. Soc., Ser. B, Stat. Methodol. 63, 411-423 (2000) · Zbl 0979.62046 · doi:10.1111/1467-9868.00293
[34] Wishart, D.; Cole, A. J. (ed.), Mode analysis: a generalization of nearest neighbor which reduces chaining effects, 282-308 (1969), London
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.