×

Clustering with missing features: a penalized dissimilarity measure based approach. (English) Zbl 1480.62121

Summary: Many real-world clustering problems are plagued by incomplete data characterized by missing or absent features for some or all of the data instances. Traditional clustering methods cannot be directly applied to such data without preprocessing by imputation or marginalization techniques. In this article, we overcome this drawback by utilizing a penalized dissimilarity measure which we refer to as the feature weighted penalty based dissimilarity (FWPD). Using the FWPD measure, we modify the traditional \(k\)-means clustering algorithm and the standard hierarchical agglomerative clustering algorithms so as to make them directly applicable to datasets with missing features. We present time complexity analyses for these new techniques and also undertake a detailed theoretical analysis showing that the new FWPD based \(k\)-means algorithm converges to a local optimum within a finite number of iterations. We also present a detailed method for simulating random as well as feature dependent missingness. We report extensive experiments on various benchmark datasets for different types of missingness showing that the proposed clustering techniques have generally better results compared to some of the most well-known imputation methods which are commonly used to handle such incomplete data. We append a possible extension of the proposed dissimilarity measure to the case of absent features (where the unobserved features are known to be undefined).

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence

Software:

UCI-ml; impute
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Acuña, E.; Rodriguez, C.; Banks, D. (ed.); McMorris, FR (ed.); Arabie, P. (ed.); Gaul, W. (ed.), The treatment of missing values and its effect on classifier accuracy, 639-647 (2004), Berlin, Heidelberg · doi:10.1007/978-3-642-17103-1_60
[2] Ahmad, S.; Tresp, V.; Hanson, S. (ed.); Cowan, J. (ed.); Giles, C. (ed.), Some solutions to the missing feature problem in vision, 393-400 (1993), Los Altos, CA
[3] Barceló, C. (2008). The impact of alternative imputation methods on the measurement of income and wealth: Evidence from the spanish survey of household finances. In Working paper series. Banco de España.
[4] Bo, T. H., Dysvik, B., & Jonassen, I. (2004). Lsimpute: Accurate estimation of missing values in microarray data with least squares methods. Nucleic Acid Research, 32(3). · doi:10.1093/nar/gnh026
[5] Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8-13), 1157-1166. · doi:10.1016/S0169-7552(97)00031-7
[6] Chan, L. S., & Dunn, O. J. (1972). The treatment of missing values in discriminant analysis-1. The sampling experiment. Journal of the American Statistical Association, 67(338), 473-477. · Zbl 0238.62066
[7] Chaturvedi, A., Carroll, J. D., Green, P. E., & Rotondo, J. A. (1997). A feature-based approach to market segmentation via overlapping k-centroids clustering. Journal of Marketing Research, pp. 370-377. · doi:10.2307/3151899
[8] Chechik, G., Heitz, G., Elidan, G., Abbeel, P., & Koller, D. (2008). Max-margin classification of data with absent features. Journal of Machine Learning Research, 9, 1-21. · Zbl 1225.68160
[9] Chen, F. (2013). Missing no more: Using the mcmc procedure to model missing data. In Proceedings of the SAS global forum 2013 conference, pp. 1-23. SAS Institute Inc.
[10] Datta, S., Bhattacharjee, S., & Das, S. (2016a). Clustering with missing features: A penalized dissimilarity measure based approach. CoRR, arXiv:1604.06602. · Zbl 1480.62121
[11] Datta, S., Misra, D., & Das, S. (2016b). A feature weighted penalty based dissimilarity measure for k-nearest neighbor classification with missing features. Pattern Recognition Letters, 80, 231-237. · doi:10.1016/j.patrec.2016.06.023
[12] Dempster, A. P., & Rubin, D. B. (1983). Incomplete data in sample surveys, vol. 2, chap. Part I: Introduction, pp. 3-10. New York: Academic Press.
[13] Dheeru, D., & Taniskidou, E. K. (2017). UCI machine learning repository. Online repository at http://archive.ics.uci.edu/ml.
[14] Dixon, J. K. (1979). Pattern recognition with partly missing data. IEEE Transactions on Systems, Man and Cybernetics, 9(10), 617-621. · doi:10.1109/TSMC.1979.4310090
[15] Donders, A. R. T., van der Heijden, G. J. M. G., Stijnen, T., & Moons, K. G. M. (2006). Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology, 59(10), 1087-1091. · doi:10.1016/j.jclinepi.2006.01.014
[16] Forgy, E. W. (1965). Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics, 21, 768-769.
[17] Grzymala-Busse, Jerzy W.; Hu, Ming, A Comparison of Several Approaches to Missing Attribute Values in Data Mining, 378-385 (2001), Berlin, Heidelberg · Zbl 1014.68558 · doi:10.1007/3-540-45554-X_46
[18] Hathaway, R. J., & Bezdek, J. C. (2001). Fuzzy c-means clustering of incomplete data. IEEE Transactions on Systems, Man, and Cybernetics: Part B: Cybernetics, 31(5), 735-744. · doi:10.1109/3477.956035
[19] Haveliwala, T., Gionis, A., & Indyk, P. (2000). Scalable techniques for clustering the web. Tech. rep.: Stanford University.
[20] Heitjan, D. F., & Basu, S. (1996). Distinguishing “missing at random” and “missing completely at random”. The American Statistician, 50(3), 207-213.
[21] Himmelspach, L., & Conrad, S. (2010). Clustering approaches for data with missing values: Comparison and evaluation. In Digital Information Management (ICDIM), 2010 fifth international conference on, pp. 19-28.
[22] Horton, N. J., & Lipsitz, S. R. (2001). Multiple imputation in practice: Comparison of software packages for regression models with missing variables. The American Statistician, 55(3), 244-254. · doi:10.1198/000313001317098266
[23] Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193-218. · doi:10.1007/BF01908075
[24] Jin, J. (2017). Genomics dataset repository. Online Repository at http://www.stat.cmu.edu/ jiashun/Research/software/GenomicsData/.
[25] Juszczak, Piotr; Duin, Robert P. W., Combining One-Class Classifiers to Classify Missing Data, 92-101 (2004), Berlin, Heidelberg · doi:10.1007/978-3-540-25966-4_9
[26] Krause, S., & Polikar, R. (2003). An ensemble of classifiers approach for the missing feature problem. In Proceedings of the international joint conference on neural networks, vol. 1, pp. 553-558. IEEE.
[27] Lasdon, L. S. (2013). Optimization theory for large systems. Courier Corporation. · Zbl 0224.90038
[28] Lei, L. (2010). Identify earthquake hot spots with 3-dimensional density-based clustering analysis. In Geoscience and remote sensing symposium (IGARSS), 2010 IEEE international, pp. 530-533. IEEE.
[29] Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: Wiley. · Zbl 0665.62004
[30] Lloyd, S. P. (1982). Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2), 129-137. · Zbl 0504.94015 · doi:10.1109/TIT.1982.1056489
[31] MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, pp. 281-297. University of California Press. · Zbl 0214.46201
[32] Marlin, B. M. (2008). Missing data problems in machine learning. Ph.D. thesis, University of Toronto.
[33] Millán-Giraldo, M., Duin, R. P., & Sánchez, J. S. (2010). Dissimilarity-based classification of data with missing attributes. In Cognitive information processing (CIP), 2010 2nd international workshop on, pp. 293-298. IEEE.
[34] Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: An overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 86-97.
[35] Myrtveit, I., Stensrud, E., & Olsson, U. H. (2001). Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering, 27(11), 999-1013. · doi:10.1109/32.965340
[36] Nanni, L., Lumini, A., & Brahnam, S. (2012). A classifier ensemble approach for the missing feature problem. Artificial Intelligence in Medicine, 55(1), 37-50. · doi:10.1016/j.artmed.2011.11.006
[37] Porro-Muñoz, Diana; Duin, Robert P. W.; Talavera, Isneri, Missing Values in Dissimilarity-Based Classification of Multi-way Data, 214-221 (2013), Berlin, Heidelberg · doi:10.1007/978-3-642-41822-8_27
[38] Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592. · Zbl 0344.62034 · doi:10.1093/biomet/63.3.581
[39] Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. London: Wiley. · doi:10.1002/9780470316696
[40] Sabau, A. S. (2012). Survey of clustering based financial fraud detection research. Informatica Economica, 16(1), 110.
[41] Schafer, J. L. (1997). Analysis of incomplete multivariate data. Boca Raton, FL: CRC Press. · doi:10.1201/9781439821862
[42] Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147-177. · doi:10.1037/1082-989X.7.2.147
[43] Sehgal, M. S. B., Gondal, I., & Dooley, L. S. (2005). Collateral missing value imputation: a new robust missing value estimation algorithm fpr microarray data. Bioinformatics, 21(10), 2417-2423. · doi:10.1093/bioinformatics/bti345
[44] Selim, S. Z., & Ismail, M. A. (1984). K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(1), 81-87. · Zbl 0546.62037 · doi:10.1109/TPAMI.1984.4767478
[45] Shelly, D. R., Ellsworth, W. L., Ryberg, T., Haberland, C., Fuis, G. S., Murphy, J., et al. (2009). Precise location of san andreas fault tremors near cholame, california using seismometer clusters: Slip on the deep extension of the fault? Geophysical Research Letters, 36(1).
[46] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., et al. (2001). Missing value estimation methods for dna microarrays. Bioinformatics, 17(6), 520-525. · doi:10.1093/bioinformatics/17.6.520
[47] Wagstaff, Kiri, Clustering with Missing Values: No Imputation Required, 649-658 (2004), Berlin, Heidelberg · doi:10.1007/978-3-642-17103-1_61
[48] Wagstaff, K. L., & Laidler, V. G. (2005). Making the most of missing values: Object clustering with partial data in astronomy. In Astronomical data analysis software and systems XIV, ASP Conference Series, pp. 172-176. Astronomical Society of the Pacific.
[49] Wang, Q., & Rao, J. N. K. (2002a). Empirical likelihood-based inference in linear models with missing data. Scandinavian Journal of Statistics, 29(3), 563-576. · Zbl 1035.62067 · doi:10.1111/1467-9469.00306
[50] Wang, Q., & Rao, J. N. K. (2002b). Empirical likelihood-based inference under imputation for missing response data. The Annals of Statistics, 30(3), 896-924. · Zbl 1029.62040 · doi:10.1214/aos/1028674845
[51] Weatherill, G., & Burton, P. W. (2009). Delineation of shallow seismic source zones using k-means cluster analysis, with application to the aegean region. Geophysical Journal International, 176(2), 565-588. · doi:10.1111/j.1365-246X.2008.03997.x
[52] Wendel, R. E., & Hurter, A. P, Jr. (1976). Minimization of a non-separable objective function subject to disjoint constraints. Operations Research, 24, 643-657. · Zbl 0347.90044 · doi:10.1287/opre.24.4.643
[53] Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80-83. · doi:10.2307/3001968
[54] Zhang, W., Yang, Y., & Wang, Q. (2012). A comparative study of absent features and unobserved values in software effort data. International Journal of Software Engineering and Knowledge Engineering, 22(02), 185-202. · doi:10.1142/S0218194012400025
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.