×

An initialization method based on hybrid distance for \(k\)-means algorithm. (English) Zbl 1418.62260

Summary: The traditional \(k\)-means algorithm has been widely used as a simple and efficient clustering method. However, the performance of this algorithm is highly dependent on the selection of initial cluster centers. Therefore, the method adopted for choosing initial cluster centers is extremely important. In this letter, we redefine the density of points according to the number of its neighbors, as well as the distance between points and their neighbors. In addition, we define a new distance measure that considers both Euclidean distance and density. Based on that, we propose an algorithm for selecting initial cluster centers that can dynamically adjust the weighting parameter. Furthermore, we propose a new internal clustering validation measure, the clustering validation index based on the neighbors (CVN), which can be exploited to select the optimal result among multiple clustering results. Experimental results show that the proposed algorithm outperforms existing initialization methods on real- world data sets and demonstrates the adaptability of the proposed algorithm to data sets with various characteristics.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
54E35 Metric spaces, metrizability
62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

k-means++
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Abbas, O. A. (2008). Comparisons between data clustering algorithms. International Arab Journal of Information Technology, 5(3), 320-325.
[2] Abdelhamid, N., Ayesh, A., & Thabtah, F. (2014). Phishing detection based associative classification data mining. Expert System Application, 41(13), 5948-5959. ,
[3] Arthur, D., & Vassilvitskii, S. (2007). k-means##img##: The advantages of careful seeding. In Proc. of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 1027-1035). New York: ACM. · Zbl 1302.68273
[4] Astrahan, M. M. (1970). Speech analysis by clustering, or the hyperphoneme method (Tech. Rep. AIM-124). Stanford, CA: Stanford University. ,
[5] Bai, L., Liang, J., Dang, C., & Cao, F. (2012). A cluster centers initialization method for clustering categorical data. Expert Systems with Applications, 39(9), 8022-8029. ,
[6] Boundaillier, E., & Hebrail, G. (1997). Interactive interpretation of hierarchical clustering. Intelligent Data Analysis, 2(1), 229-244.
[7] Cao, F., Liang, J., & Bai, L. (2009). A new initialization method for categorical data clustering. Expert Systems and Applications, 36(7), 10223-10228. ,
[8] Cao, F., Liang, J., & Jiang, G. (2009). An initialization method for the k-means algorithm using neighborhood model. Computers and Mathematics with Applications, 58(3), 474-483. , · Zbl 1189.68095
[9] Ester, M., Kriegel, H., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. 2nd Int. Conf. Knowledge Discovery Data Mining (pp. 226-231). Cambridge: AAAI Press.
[10] Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics, 21(3), 41-52.
[11] Freitas, A. A. (2002). Data mining and knowledge discovery with evolutionary algorithm. New York: Springer. , · Zbl 1013.68075
[12] Gan, G., Ma, C., & Wu, J. (2007). Data clustering: Theory, algorithms, and applications. Philadelphia: Society for Industrial and Applied Mathematics. , · Zbl 1185.68274
[13] Gonzalez, T. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38(2-3), 293-306. , · Zbl 0567.62048
[14] Höppner, F., F. Klawonn, Kruse, R., & Runkler, T. (1999). Fuzzy cluster analysis: Methods for classification, data analysis and image recognition. New York: Wiley. · Zbl 0944.65009
[15] Jancey, R. C. (1966). Multidimensional group analysis. Australian Journal of Botany, 14(1), 127-130. ,
[16] Jiang, F., Liu, G., Du, J., & Sui, Y. (2016). Initialization of K-modes clustering using outlier detection techniques. Information Sciences, 333, 167-183. , · Zbl 06871070
[17] Katsavounidis, I., Kuo, C.-C. J., & Zhang, Z. (1994). A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters, 1(10), 144-146. ,
[18] Khan, S. S., & Ahmad, A. (2004). Cluster center initialization algorithm for k-means clustering. Pattern Recognition Letters, 25(11), 1293-1302. ,
[19] Khan, S. S., & Ahmad, A. (2013). Cluster center initialization algorithm for k-modes clustering. Expert Systems with Applications, 40(18), 7444-7456. ,
[20] Liu, Y., Liu, Z., Xiong, H., Gao, X., Wu, J., & Wu, S. (2013). Understanding and enhancement of internal clustering validation measure. IEEE Trans. on Cybernetics, 43(3), 982-994. ,
[21] MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability (pp. 281-297). Berkeley: University of California Press. · Zbl 0214.46201
[22] Redmond, S. J., & Heneghan, C. (2007). A method for initialising the k-means clustering algorithm using kd-trees. Pattern Recognition Letters, 28(8), 965-973. ,
[23] Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks, Science, 344(6191), 1492-1496. ,
[24] Wu, S., Jiang, Q., & Huang, J. Z. (2007). A new initialization method for clustering categorical data. In Proc. of the 11th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (pp. 972-980). Berlin: Springer-Verlag. ,
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.