A robust method for cluster analysis. (English) Zbl 1064.62074

Summary: Let there be given a contaminated list of \(n\) \(\mathbb{R}^d\)-valued observations coming from \(g\) different, normally distributed populations with a common covariance matrix. We compute the ML-estimator with respect to a certain statistical model with \(n-r\) outliers for the parameters of the \(g\) populations; it detects outliers and simultaneously partitions their complement into \(g\) clusters. It turns out that the estimator unites both the minimum-covariance-determinant rejection method and the well-known pooled determinant criterion of cluster analysis. We also propose an efficient algorithm for approximating this estimator and study its breakdown points for mean values and pooled within groups sum of squares and products matrices.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F35 Robustness and adaptive procedures (parametric inference)
65C60 Computational problems in statistics (MSC2010)
Full Text: DOI arXiv


[1] Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data , 3rd ed. Wiley, Chichester. · Zbl 0801.62001
[2] Bezdek, J. C., Keller, J., Krisnapuram, R. and Pal, N. R. (1999). Fuzzy Models and Algorithms for Pattern Recognition and Image Processing . Kluwer, Dordrecht. · Zbl 0998.68138
[3] Coleman, D. A. and Woodruff, D. L. (2000). Cluster analysis for large datasets: An effective algorithm for maximizing the mixture likelihood. J. Comput. Graph. Statist. 9 672–688.
[4] Cuesta-Albertos, J. A., Gordaliza, A. and Matrán, C. (1997). Trimmed \(k\)-means: An attempt to robustify quantizers. Ann. Statist. 25 553–576. · Zbl 0878.62045
[5] Donoho, D. L. and Huber, P. J. (1983). The notion of a breakdown point. In A Festschrift for Erich L. Lehmann (P. J. Bickel, K. A. Doksum and J. L. Hodges, Jr., eds.) 157–184. Wadsworth, Belmont, CA. · Zbl 0523.62032
[6] Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631. · Zbl 1073.62545
[7] Friedman, H. and Rubin, J. (1967). On some invariant criteria for grouping data. J. Amer. Statist. Assoc. 62 1159–1178.
[8] Garciá-Escudero, L. A. and Gordaliza, A. (1999). Robustness properties of \(k\)-means and trimmed \(k\)-means. J. Amer. Statist. Assoc. 94 956–969. · Zbl 1072.62547
[9] Garciá-Escudero, L. A., Gordaliza, A. and Matrán, C. (2003). Trimming tools in exploratory data analysis. J. Comput. Graph. Statist. 12 434–449.
[10] Gather, U. and Kale, B. K. (1988). Maximum likelihood estimation in the presence of outliers. Comm. Statist. Theory Methods 17 3767–3784. · Zbl 0696.62119
[11] Hampel, F. R. (1968). Contributions to the theory of robust estimation. Ph.D. dissertation, Univ. California, Berkeley.
[12] Hampel, F. R. (1971). A general qualitative definition of robustness. Ann. Math. Statist. 42 1887–1896. · Zbl 0229.62041
[13] Hartigan, J. A. (1975). Clustering Algorithms . Wiley, New York. · Zbl 0372.62040
[14] Hodges, J. L., Jr. (1967). Efficiency in normal samples and tolerance of extreme values for some estimates of location. Proc. Fifth Berkeley Symp. Math. Statist. Probab. 1 163–186. Univ. California Press, Berkeley. · Zbl 0211.50205
[15] Lopuhaä, H. P. and Rousseeuw, P. J. (1991). Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Ann. Statist. 19 229–248. JSTOR: · Zbl 0733.62058
[16] Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis . Academic Press, London. · Zbl 0432.62029
[17] Mathar, R. (1981). Ausreiß er bei ein- und mehrdimensionalen Wahrscheinlichkeitsverteilungen. Ph.D. dissertation, Mathematisch–Naturwissenschaftliche Fakultät der Rheinisch-Westfälischen Technischen Hochschule Aachen. · Zbl 0511.62033
[18] Pesch, C. (2000). Eigenschaften des gegenüber Ausreissern robusten MCD-Schätzers und Algorithmen zu seiner Berechnung. Ph.D. dissertation, Fakultät für Mathematik und Informatik, Univ. Passau.
[19] Ritter, G. and Gallegos, M. T. (1997). Outliers in statistical pattern recognition and an application to automatic chromosome classification. Pattern Recognition Letters 18 525–539.
[20] Ritter, G. and Gallegos, M. T. (2002). Bayesian object identification: Variants. J. Multivariate Anal. 81 301–334. · Zbl 1011.62011
[21] Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications (W. Grossmann, G. C. Pflug, I. Vincze and W. Wertz, eds.) 283–297. Reidel, Dordrecht. · Zbl 0609.62054
[22] Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics 41 212–223.
[23] Schroeder, A. (1976). Analyse d’un mélange de distributions de probabilités de même type. Rev. Statist. Appl. 24 39–62.
[24] Scott, A. J. and Symons, M. J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics 27 387–397.
[25] Späth, H. (1985). Cluster Dissection and Analysis. Theory , FORTRAN Programs , Examples . Ellis Horwood, Chichester. · Zbl 0584.62094
[26] Symons, M. J. (1981). Clustering criteria and multivariate normal mixtures. Biometrics 37 35–43. · Zbl 0473.62048
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.