×

A review of robust clustering methods. (English) Zbl 1284.62375

Summary: Deviations from theoretical assumptions together with the presence of certain amount of outlying observations are common in many practical statistical applications. This is also the case when applying cluster analysis methods, where those troubles could lead to unsatisfactory clustering results. Robust clustering methods are aimed at avoiding these unsatisfactory results. Moreover, there exist certain connections between robust procedures and cluster analysis that make robust clustering an appealing unifying framework. A review of different robust clustering approaches in the literature is presented. Special attention is paid to methods based on trimming which try to discard most outlying data when carrying out the clustering process.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G35 Nonparametric robustness
68T05 Learning and adaptive systems in artificial intelligence

Software:

clusfind; Flury
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Atkinson AC, Riani M (2007) Exploratory tools for clustering multivariate data. Comput Stat Data Anal 52: 272–285 · Zbl 1452.62028
[2] Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer Series in Statistics, Springer, New York · Zbl 1049.62057
[3] Atkinson AC, Riani M, Cerioli A (2006) Random start forward searches with envelopes for detecting clusters in multivariate data. In: Zani S, Cerioli A, Riani M, Vichi M (eds) Data analysis, classification and the forward search, pp 163–172
[4] Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49: 803–821 · Zbl 0794.62034
[5] Bock H-H (1996a) Probability models and hypotheses testing in partitioning cluster analysis. In: Arabie P, Hubert LJ, De Soete G (eds) Clustering and classification. World Scientific, River Edge, pp 377–453 · Zbl 1031.62504
[6] Bock H-H (1996b) Probabilistic models in cluster analysis. Comput Stat Data Anal 23: 5–28 · Zbl 0900.62324
[7] Bryant PG (1991) Large-sample results for optimization-based clustering methods. Comput Stat Data Anal 23: 5–28
[8] Byers SD, Raftery AE (1998) Nearest neighbor clutter removal for estimating features in spatial point processes. J Am Stat Assoc 93: 577–584 · Zbl 0926.62089
[9] Celeux G, Govaert A (1992a) Classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 13: 315–332 · Zbl 0937.62605
[10] Celeux G, Govaert A (1992b) Gaussian parsimonious clustering models. Pattern Recognit 28: 781–793 · Zbl 05480211
[11] Cerioli A, Riani M, Atkinson AC (2006) Robust classification with categorical variables. In: Rizzi A, Vichi M (eds) Proceedings in computational statistics, pp 507–519
[12] Croux C, Gallopoulos E, Van Aelst S, Zha H (2007) Machine learning and robust data mining. Comput Stat Data Anal 52: 151–154 · Zbl 1452.00020
[13] Cuesta-Albertos JA, Fraiman R (2007) Impartial trimmed k-means for functional data. Comput Stat Data Anal 51: 4864–4877 · Zbl 1162.62377
[14] Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25: 553–576 · Zbl 0878.62045
[15] Cuesta-Albertos JA, Gordaliza A, Matrán C (1998) Trimmed best k-nets. A robustifyed version of a L based clustering method. Stat Probab Lett 36: 401–413 · Zbl 0894.62078
[16] Cuesta-Albertos JA, García-Escudero LA, Gordaliza A (2002) On the asymptotics of trimmed best k-nets. J Multivar Anal 82: 482–516 · Zbl 1098.62540
[17] Cuesta-Albertos JA, Matran C, Mayo-Iscar A (2008) Robust estimation in the normal mixture model based on robust clustering. J R Stat Soc Ser B 70: 779–802 · Zbl 05563369
[18] Cuevas A, Febrero M, Fraiman R (2001) Cluster analysis: a further approach based on density estimation. Comput Stat Data Anal 36: 441–459 · Zbl 1053.62537
[19] Dasgupta A, Raftery AE (1998) Detecting features in spatial point processes with clutter via model-based clustering. J Am Stat Assoc 93: 294–302 · Zbl 0906.62105
[20] Davé RN, Krishnapuram R (1997) Robust clustering methods: a unified view. IEEE Trans Fuzzy Syst 5: 270–293
[21] Davies PL, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88: 782–801 · Zbl 0797.62025
[22] Ding Y, Dang X, Peng H, Wilkins D (2007) Robust clustering in high dimensional data using statistical depths. BMC Bioinformatics 8(Suppl 7): S8
[23] Donoho DL, Huber PJ (1983) The notion of breakdown point. In: Bickel PJ, Doksum K, Hodges JL Jr (eds) A Festschrift for Erich L. Lehmann. Wadsworth, Belmont, pp 157–184
[24] Estivill-Castro V, Yang J (2004) Fast and robust general purpose clustering algorithms. Data Min Knowl Discov 8: 127–150 · Zbl 02184558
[25] Everitt BS (1977) Cluster analysis. Heinemann Education Books, London
[26] Flury B (1997) A first course in multivariate statistics. Springer-Verlag, New York · Zbl 0879.62052
[27] Forgy E (1965) Cluster analysis of multivariate data: efficiency versus interpreability of classifications. Biometrics 21: 768
[28] Fraley C, Raftery AE (1998) How many clusters? Which clustering methods? Answers via model-based cluster analysis. Comput J 41: 578–588 · Zbl 0920.68038
[29] Friedman HP, Rubin J (1967) On some invariant criterion for grouping data. J Am Stat Assoc 63: 1159–1178
[30] Gallegos MT (2002) Maximum likelihood clustering with outliers. In: Jajuga K, Sokolowski A, Bock HH (eds) Classification, clustering and data analysis: recent advances and applications. Springer-Verlag, Berlin, pp 247–255 · Zbl 1032.62059
[31] Gallegos MT, Ritter G (2005) A robust method for cluster analysis. Ann Stat 33: 347–380 · Zbl 1064.62074
[32] Gallegos MT, Ritter G (2009) Trimming algorithms for clustering contaminated grouped data and their robustness. Adv Data Anal Classif 3: 135–167 · Zbl 1284.62372
[33] García-Escudero LA, Gordaliza A (1999) Robustness properties of k-means and trimmed k-means. J Am Stat Assoc 94: 956–969 · Zbl 1072.62547
[34] García-Escudero LA, Gordaliza A (2005a) Generalized radius processes for elliptically contoured distributions. J Am Stat Assoc 471: 1036–1045 · Zbl 1117.62339
[35] García-Escudero LA, Gordaliza A (2005b) A proposal for robust curve clustering. J Classif 22: 185–201 · Zbl 1336.62179
[36] García-Escudero LA, Gordaliza A (2007) The importance of the scales in heterogeneous robust clustering. Comput Stat Data Anal 51: 4403–4412 · Zbl 1162.62379
[37] García-Escudero LA, Gordaliza A, Matrn C (1999) A central limit theorem for multivariate generalized trimmed k-means. Ann Stat 27: 1061–1079 · Zbl 0984.62042
[38] García-Escudero LA, Gordaliza A, Matrán C (2003) Trimming tools in exploratory data analysis. J Comput Graph Stat 12: 434–449
[39] García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36: 1324–1345 · Zbl 1360.62328
[40] García-Escudero LA, Gordaliza A, San Martín R, Van Aelst S, Zamar R (2009) Robust linear clustering. J R Stat Soc Ser B 71: 301–318 · Zbl 1231.62112
[41] García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010) Exploring the number of groups in robust model-based clustering. (submitted.) Preprint http://www.eio.uva.es/infor/personas/langel.html · Zbl 1284.62375
[42] Gordaliza A (1991) Best approximations to random variables based on trimming procedures. J Approx Theory 64: 162–180 · Zbl 0745.41030
[43] Gordon AD (1981) Classification. Chapman and Hall, London
[44] Hampel FR, Rousseeuw PJ, Ronchetti E, Stahel WA (1986) Robust statistics, the approach based on the influence function. Wiley, New York
[45] Hardin J, Rocke D (2004) Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Comput Stat Data Anal 44: 625–638 · Zbl 1430.62133
[46] Hathaway RJ (1985) A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann Stat 13: 795–800 · Zbl 0576.62039
[47] Hennig C (2003) Clusters, outliers, and regression: fixed point clusters. J Multivar Anal 86: 183–212 · Zbl 1020.62051
[48] Hennig C (2004) Breakdown points for maximum likelihood-estimators of location-scale mixtures. Ann Stat 32: 1313–1340 · Zbl 1047.62063
[49] Hennig C (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. J Multivar Anal 99: 1154–1176 · Zbl 1141.62052
[50] Huber PJ (1981) Robust statistics. Wiley, New York · Zbl 0536.62025
[51] Jiang MF, Tseng SS, Su CM (2001) Two-phase clustering process for outliers detection. Pattern Recognit Lett 22: 691–700 · Zbl 1010.68908
[52] Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York · Zbl 1345.62009
[53] Kumar M, Orlin JB (2008) Scale-invariant clustering with minimum volume ellipsoids. Comput Oper Res 35: 1017–1029 · Zbl 1142.62042
[54] Markatou M (2000) Mixture models, robustness, and the weighted likelihood methodology. Biometrics 356: 483–486 · Zbl 1060.62511
[55] Maronna R (2005) Principal components and orthogonal regression based on robust scales. Technometrics 47: 264–273
[56] Maronna R, Jacovkis PM (1974) Multivariate clustering procedures with variable metrics. Biometrics 30: 499–505 · Zbl 0285.62036
[57] Massart DL, Plastria E, Kaufman L (1983) Non-hierarchical clustering with MASLOC. Pattern Recognit 16: 507–516
[58] McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York · Zbl 0963.62061
[59] McLachlan GJ, Ng S-K, Bean R (2006) Robust cluster analysis via mixture models. Austrian J Stat 35: 157–174
[60] Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179
[61] Müller DW, Sawitzki G (1991) Excess mass estimates and tests for multimodality. J Am Stat Assoc 86: 738–746 · Zbl 0733.62040
[62] Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52: 299–308 · Zbl 1328.62033
[63] Perrotta D, Riani M, Torti F (2009) New robust dynamic plots for regression mixture detection. Adv Data Anal Classif 3: 263–279 · Zbl 1306.62079
[64] Polonik W (1995) Measuring mass concentrations and estimating density contour clusters: an excess mass approach. Ann Stat 23: 855–881 · Zbl 0841.62045
[65] Rocke DM, Woodruff DM (1996) Identification of outliers in multivariate data. J Am Stat Assoc 91: 1047–1061 · Zbl 0882.62049
[66] Rocke DM, Woodruff DM (2002) Computational connections between robust multivariate analysis and clustering. In: Härdle W, Rönz B (eds) COMPSTAT 2002 proceedings in computational statistics. Physica-Verlag, Heidelberg, pp 255–260
[67] Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. In: Grossmann W, Pflug G, Vincze I, Wertz W (eds) Mathematical statistics and applications. Reidel, Dordrecht, pp 283–297
[68] Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley-Interscience, New York · Zbl 0711.62030
[69] Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41: 212–223
[70] Rousseeuw PJ, Van Driessen K (2000) An algorithm for positive-breakdown regression based on concentration steps. In: Gaul W, Opitz O, Schader M (eds) Data analysis: scientific modeling and practical application. Springer Verlag, New York, pp 335–446
[71] Santos-Pereira CM, Pires AM (2002) Detection of outliers in multivariate data, a method based on clustering and robust estimators. In: Proceedings in computational statistics, pp 291–296
[72] Schynsa M, Haesbroeck G, Critchley F (2010) RelaxMCD: smooth optimisation for the minimum covariance determinant estimator. Comput Stat Data Anal 54: 843–857 · Zbl 1464.62156
[73] Späth H (1975) Cluster-Analyse-Algorithmen zur Objektklassifizierung und Datenreduktion. Oldenbourg Verlag, Münchenwien · Zbl 0308.62044
[74] Van Aelst S, Wang X, Zamar RH, Zhu R (2006) Linear grouping using orthogonal regression. Comput Stat Data Anal 50: 1287–1312 · Zbl 1431.62273
[75] Vinod HD (1969) Integer programming and the theory of grouping. J Am Stat Assoc 64: 506–519 · Zbl 0272.90050
[76] Willems G, Joe H, Zamar R (2009) Diagnosing multivariate outliers detected by robust estimators. J Comput Graph Stat 18: 73–91
[77] Woodruff DL, Reiners T (2004) Experiments with, and on, algorithms for maximum likelihood clustering. Comput Stat Data Anal 47: 237–253 · Zbl 1429.62269
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.