Trimming algorithms for clustering contaminated grouped data and their robustness. (English) Zbl 1284.62372

Summary: We establish an affine equivariant, constrained heteroscedastic model and criterion with trimming for clustering contaminated, grouped data. We show existence of the maximum likelihood estimator, propose a method for determining an appropriate constraint, and design a strategy for finding reasonable clusterings. We finally compute breakdown points of the estimated parameters thereby showing asymptotic robustness of the method.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F35 Robustness and adaptive procedures (parametric inference)
Full Text: DOI


[1] Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, Chichester · Zbl 0801.62001
[2] Becker C, Gather U (1999) The masking breakdown point of multivariate outlier identification rules. JASA 94: 947–955 · Zbl 1072.62600
[3] Bezdek JC, Keller J, Krisnapuram R, Pal NR (1999) Fuzzy models and algorithms for pattern recognition and image processing. The handbooks of fuzzy sets series. Kluwer, Boston · Zbl 0998.68138
[4] Bock H-H (1985) On some significance tests in cluster analysis. J Class 2: 77–108 · Zbl 0587.62048
[5] Chen H, Chen J, Kalbfleisch JD (2004) Testing for a finite mixture model with two components. J R Stat Soc Ser B 66: 95–115 · Zbl 1061.62025
[6] Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25: 553–576 · Zbl 0878.62045
[7] Dennis JE Jr (1981) Algorithms for nonlinear fitting. In: Powell MJD (eds) Nonlinear optimization 1981. Procedings of the NATO Advanced Research Institute held at Cambridge in July 1981. Academic Press, London
[8] Donoho DL, Huber PJ (1983) The notion of a breakdown point. In: Bickel PJ, Doksum KA, Hodges JL (eds) A Festschrift for Erich L. Lehmann, The Wadsworth Statistics/Probability Series. Wadsworth, Belmont, pp 157–184
[9] Gallegos MT, Ritter G (2005) A robust method for cluster analysis. Ann Stat 33: 347–380 · Zbl 1064.62074
[10] Gallegos MT, Ritter G (2009) Using combinatorial optimization in model-based clustering under spurious outliers and cardinality constraints. Comput Statist Data Anal (to appear)
[11] García-Escudero LA, Gordaliza A (1999) Robustness properties of k-means and trimmed k-means. J Am Stat Assoc 94: 956–969 · Zbl 1072.62547
[12] García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36: 1324–1345 · Zbl 1360.62328
[13] Gordon AD (1999) Classification. Monographs on statistics and applied probability, vol 82, 2nd edn. CRC Press, New York
[14] Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13: 795–800 · Zbl 0576.62039
[15] Hodges JL Jr (1967) Efficiency in normal samples and tolerance of extreme values for some estimates of location. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley, pp 163–186
[16] Kéribin C (2000) Consistent estimation of the order of mixture models. Sankhyā 62(Series A): 49–66
[17] McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York · Zbl 0963.62061
[18] Mecklin CJ, Mundfrom DJ (2004) An appraisal and bibliography of tests for multivariate normality. Int Stat Rev 72(1): 123–138 · Zbl 1211.62095
[19] Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179
[20] Mucha H-J, Bartel HG, Dolata J (2002) Exploring Roman brick and tile by cluster analysis with validation of results. In: Gaul W, Ritter G (eds) Classification, automation, and new media. Studies in classification, data analysis, and knowledge organization. Springer, Berlin, pp 471–478
[21] Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52: 299–308 · Zbl 1328.62033
[22] Pollard D (1981) Strong consistency of k-means clustering. Ann Stat 9: 135–140 · Zbl 0451.62048
[23] Ritter G, Gallegos MT (1997) Outliers in statistical pattern recognition and an application to automatic chromosome classification. Patt Rec Lett 18: 525–539 · Zbl 05471645
[24] Rocke DM, Woodruff DL (1999) A synthesis of outlier detection and cluster identification. Technical report, University of California, Davis. http://handel.cipic.ucdavis.edu/\(\sim\)dmrocke/Synth5.pdf
[25] Schroeder A (1976) Analyse d’un mélange de distributions de probabilités de même type. Revue de Statistique Appliquée 24: 39–62
[26] Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6: 461–464 · Zbl 0379.62005
[27] Symons MJ (1981) Clustering criteria and multivariate normal mixtures. Biometrics 37: 35–43 · Zbl 0473.62048
[28] Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 63: 411–423 · Zbl 0979.62046
[29] Wolfe JH (1970) Pattern clustering by multivariate mixture analysis. Multivar Behav Res 5: 329–350
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.