×

Modal clustering asymptotics with applications to bandwidth selection. (English) Zbl 1477.62117

Summary: Density-based clustering relies on the idea of linking groups to some specific features of the probability distribution underlying the data. The reference to a true, yet unknown, population structure allows framing the clustering problem in a standard inferential setting, where the concept of ideal population clustering is defined as the partition induced by the true density function. The nonparametric formulation of this approach, known as modal clustering, draws a correspondence between the groups and the domains of attraction of the density modes. Operationally, a nonparametric density estimate is required and a proper selection of the amount of smoothing, governing the shape of the density and hence possibly the modal structure, is crucial to identify the final partition. In this work, we address the issue of density estimation for modal clustering from an asymptotic perspective. A natural and easy to interpret metric to measure the distance between density-based partitions is discussed, its asymptotic approximation explored, and employed to study the problem of bandwidth selection for nonparametric modal clustering.

MSC:

62G20 Asymptotic properties of nonparametric inference
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G07 Density estimation
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Ameijeiras-Alonso, J. and Crujeiras, R. M. and Rodriguez-Casal, A. (2018). Multimode: An R Package for Mode Assessment., arXiv preprint arXiv:1803.00472. · Zbl 1420.62155 · doi:10.1007/s11749-018-0611-5
[2] Baillo, A., Cuesta-Albertos, J. A. and Cuevas, A. (2001). Convergence rates in nonparametric estimation of level sets., Statistics & probability letters. 53(1) 27-35. · Zbl 0980.62022 · doi:10.1016/S0167-7152(01)00006-2
[3] Ben-David, S., von Luxburg, U. and Pál, D. (2006). A sober look at clustering stability. In, Proceedings of the 19th Annual Conference on Learning Theory (G. Lugosi and H.-U. Simon, eds.), pp. 5-19. Springer. · Zbl 1143.68520
[4] Chacón, J. E. (2015). A population background for nonparametric density-based clustering., Statistical Science. 30(4) 518-532. · Zbl 1426.62181
[5] Chacón, J. E. (2019). Mixture model modal clustering., Advances in Data Analysis and Classification. 13(2) 379-404. · Zbl 1474.62218
[6] Chacón, J. E. and Duong, T. (2013). Data-driven density derivative estimation, with applications to nonparametric clustering and bump hunting., Electronic Journal of Statistics. 7 499-532. · Zbl 1337.62067 · doi:10.1214/13-EJS781
[7] Chacón, J. E. and Duong, T. and Wand, M. P. (2011). Asymptotics for general multivariate kernel density derivative estimators., Statistica Sinica. 21 807-840. · Zbl 1214.62039 · doi:10.5705/ss.2011.036a
[8] Chacón, J. E. and Duong, T. (2018)., Multivariate Kernel Smoothing and Its Applications. Chapman & Hall. · Zbl 1402.62003
[9] Chacón, J. E. and Monfort, P. (2014). A comparison of bandwidth selctors for mean shift clustering. In, Theoretical and Applied Issues in Statistics and Demography (C. H. Skiadas, ed.) 47-59. International Society for the Advancement of Science and Technology (ISAST), Athens.
[10] Chen, Y.-C., Genovese, C. R. and Wasserman, L. (2016). A comprehensive approach to mode clustering., Electronic Journal of Statistics. 10(1) 210-241. · Zbl 1332.62200 · doi:10.1214/15-EJS1102
[11] Chen, Y.-C., Genovese, C. R. and Wasserman, L. (2017). Statistical inference using the Morse-Smale complex., Electronic Journal of Statistics. 11(1) 1390-1433. · Zbl 1362.62078 · doi:10.1214/17-EJS1271
[12] Chernoff, H. (1964). Estimation of the mode., Annals of the Institute of Statistical Mathematics. 16 31-41. · Zbl 0212.21802 · doi:10.1007/BF02868560
[13] Cuevas, A., Febrero, M. and Fraiman, R. (2001). Cluster analysis: a further approach based on density estimation., Computational Statistics & Data Analysis. 36(4) 441-459. · Zbl 1053.62537 · doi:10.1016/S0167-9473(00)00052-9
[14] Devroye, L. and Györfi, L. (1985)., Nonparametric Density Estimation: the \(L_1\) View Wiley, New York. · Zbl 0546.62015
[15] Doss, C. R. and Weng, G. (2018). Bandwidth selection for kernel density estimators of multivariate level sets and highest density regions., Electronic Journal of Statistics. 12(2) 4313-4376. · Zbl 1409.62083 · doi:10.1214/18-EJS1501
[16] Duong, T. (2018)., ks: Kernel Smoothing URL https://CRAN.R-project.org/package=ks R package version 1.11.3.
[17] Einbeck, J. (2011). Bandwidth selection for mean-shift based unsupervised learning techniques: a unified approach via self-coverage., Journal of pattern recognition research. 6(2) 175-192.
[18] Everitt, B. S., Landau, S., Leese, M. and Sthal, D. (2011)., Cluster Analysis. (5th Ed.). John Wiley & Sons, Inc. · Zbl 1274.62003
[19] Fukunaga, K. and Hostetler, L. (1975). The estimation of the gradient of a density function, with applications in pattern recognition., IEEE Transactions on information theory. 21(1) 32-40. · Zbl 0297.62025 · doi:10.1109/TIT.1975.1055330
[20] Grund, B. and Hall, P. (1995). On the minimisation of the \(L^p\) error in mode estimation., Annals of Statistics 23 2265-2284. · Zbl 0853.62029 · doi:10.1214/aos/1034713656
[21] Hall, P. and Marron, J. S. (1991). Lower bounds for bandwidth selection in density estimation., Probability Theory and Related Fields 90 149-173. · Zbl 0742.62041 · doi:10.1007/BF01192160
[22] Hall, P. and Wand, M. P. (1988). On the minimization of absolute distance in kernel density estimation., Statistics and Probability Letters 6 311-314. · Zbl 0629.62037 · doi:10.1016/0167-7152(88)90005-3
[23] Hennig, C., Meila, M., Murtagh, F. and Rocci, R. (2016)., Handbook of Cluster Analysis. Chapman & Hall. · Zbl 1331.68001
[24] Hornik, K. (2018)., Clue: Cluster ensembles. URL https://CRAN.R-project.org/package=clue R package version 0.3-55.
[25] Jones, M. C. (1992). Potential for automatic bandwidth choice in variations on kernel density estimation., Statistics and Probability Letters 13 351-356.
[26] Kaufman, L. and Rousseeuw, P. J. (2005)., Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Inc. · Zbl 1345.62009
[27] Leone, F. C., Nelson, L. S. and Nottingham, R. B. (1961). The folded normal distribution., Technometrics 3 543-550.
[28] Lisic, J. (2018)., MeanShiftR: A Computationally Efficient Mean Shift Implementation. URL https://CRAN.R-project.org/package=meanShiftR. R package version 0.52.
[29] Matsumoto, Y. (2002)., An introduction to Morse Theory. American Mathematical Society. · Zbl 0990.57001
[30] McNicholas, P. D. (2016). Model-based clustering., Journal of Classification. 33(3) 331-373. · Zbl 1364.62155 · doi:10.1007/s00357-016-9211-9
[31] Meilă, M. (2016). Criteria for comparing clusterings. In C. Hennig, M. Meilă, F. Murtagh and R. Rocci (Eds.), Handbook of Cluster Analysis 619-635. CRC Press. · Zbl 1396.62150
[32] Menardi, G. (2016). A review on modal clustering., International Statistical Review 84(3) 413-433. · Zbl 07763532
[33] Qiao, W. (2020). Asymptotics and optimal bandwidth selection for nonparametric estimation of density level sets., Electronic Journal of Statistics 14(1) 302-344. · Zbl 1428.62184 · doi:10.1214/19-EJS1668
[34] R Core Team (2018), R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
[35] Romano, J. P. (1988). On weak convergence and optimality of kernel density estimates of the mode., Annals of Statistics 16 629-647. · Zbl 0658.62053 · doi:10.1214/aos/1176350824
[36] Saavedra-Nieves, P., González-Manteiga, W. and Rodríguez-Casal, A. (2014). Level set estimation., In Topics in Nonparametric Statistics (M. G. Akritas, S. N. Lahiri and D. N. Politis, eds.). Springer Proceedings in Mathematics & Statistics 74 299-307. · Zbl 1383.62102
[37] Samworth, R. J. and Wand, M. P. (2010). Asymptotics and optimal bandwidth selection for highest density region estimation., Annals of Statistics 38(3) 1767-1792. · Zbl 1189.62061 · doi:10.1214/09-AOS766
[38] Scott, D. W. (2015). Multivariate density estimation: theory, practice and visualization. John Wiley &, Sons. · Zbl 1311.62004
[39] Scrucca, L. (2016). Identifying connected components in Gaussian finite mixture models for clustering., Computational Statistics & Data Analysis 93 5-17. · Zbl 1468.62174 · doi:10.1016/j.csda.2015.01.006
[40] Silverman, B. W. (1981). Using kernel density estimates to investigate multimodality., Journal of the Royal Statistical Society. Series B 43 97-99.
[41] Silverman, B. W. (1986)., Density Estimation for Statistics and Data Analysis. Chapman & Hall. · Zbl 0617.62042
[42] Singh, R. S. (1987). MISE of kernel estimates of a density and its derivatives., Statistics and Probability Letters. 5 153-159. · Zbl 0635.62028 · doi:10.1016/0167-7152(87)90072-1
[43] Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample., Journal of Classification. 20(1) 25-47. · Zbl 1055.62075 · doi:10.1007/s00357-003-0004-6
[44] Thom, R. (1949). Sur une partition en cellules associée à une fonction sur une variété., Comptes Rendus Hebdomadaires des Séances de l’Académie des Sciences, 228 973-975. · Zbl 0034.20802
[45] von Luxburg, U. (2010). Clustering stability: an overview., Foundations and Trends in Machine Learning, 2 235-274. · Zbl 1191.68615 · doi:10.1561/2200000008
[46] Wand, M. P. and Jones, M. C. (1993). Comparison of smoothing parameterizations in bivariate kernel density estimation., Journal of the American Statistical Association 88(422) 520-528. · Zbl 0775.62105 · doi:10.1080/01621459.1993.10476303
[47] Wand, M. · Zbl 0854.62043
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.