×

Using combinatorial optimization in model-based trimmed clustering with cardinality constraints. (English) Zbl 1464.62075

Summary: Statistical clustering criteria with free scale parameters and unknown cluster sizes are inclined to create small, spurious clusters. To mitigate this tendency a statistical model for cardinality-constrained clustering of data with gross outliers is established, its maximum likelihood and maximum a posteriori clustering criteria are derived, and their consistency and robustness are analyzed. The criteria lead to constrained optimization problems that can be solved by using iterative, alternating trimming algorithms of \(k\)-means type. Each step in the algorithms requires the solution of a \(\lambda\)-assignment problem known from combinatorial optimization. The method allows one to estimate the numbers of clusters and outliers. It is illustrated with a synthetic data set and a real one.

MSC:

62-08 Computational methods for problems pertaining to statistics
62F35 Robustness and adaptive procedures (parametric inference)
62H30 Classification and discrimination; cluster analysis (statistical aspects)
90C27 Combinatorial optimization
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Achatz, H.; Kleinschmidt, P.; Paparrizos, K., A dual forest algorithm for the assignment problem, (), 1-10 · Zbl 0743.90077
[2] Ahuja, R.K.; Orlin, J.B.; Stein, C.; Tarjan, R.E., Improved algorithms for bipartite network flows, SIAM J. comput., 23, 906-933, (1994) · Zbl 0840.90063
[3] Anderson, E., The irises of the gaspé peninsula, Bull. amer. iris soc., 59, 2-5, (1935)
[4] Aurenhammer, F.; Hoffmann, F.; Aronov, B., Minkowski-type theorems and least-squares clustering, Algorithmica, 20, 61-76, (1998) · Zbl 0895.68135
[5] Barnett, V.; Lewis, T., Outliers in statistical data, (1994), Wiley Chichester, UK · Zbl 0801.62001
[6] Becker, C.; Gather, U., The masking breakdown point of multivariate outlier identification rules, Jasa, 94, 947-955, (1999) · Zbl 1072.62600
[7] Bernholt, T.; Fischer, P., The complexity of computing the MCD-estimator, Theoret. comput. sci., 326, 383-398, (2004) · Zbl 1054.62028
[8] Bezdek, J.C.; Keller, J.; Krisnapuram, R.; Pal, N.R., ()
[9] Bock, H.-H., On some significance tests in cluster analysis, J. classification, 2, 77-108, (1985) · Zbl 0587.62048
[10] Carpaneto, G.; Martello, S.; Toth, P., Algorithms and codes for the assignment problem, Ann. oper. res., 13, 193-223, (1988)
[11] Chen, H.; Chen, J.; Kalbfleisch, J.D., Testing for a finite mixture model with two components, J. roy. stat. soc. ser. B, 66, 95-115, (2004) · Zbl 1061.62025
[12] Coleman, D.A.; Dong, X.; Hardin, J.; Rocke, D.M.; Woodruff, D.L., Some computational issues in cluster analysis with no a priori metric, Csda, 31, 1-11, (1999) · Zbl 0942.62068
[13] Cook, W.J.; Cunningham, W.H.; Pulleyblank, W.R.; Schrijver, A., Combinatorial optimization, (1998), Wiley New York etc · Zbl 0909.90227
[14] Cuesta-Albertos, J.A.; Gordaliza, A.; Matrán, C., Trimmed k-means: an attempt to robustify quantizers, Ann. statist., 25, 553-576, (1997) · Zbl 0878.62045
[15] Davies, L.; Gather, U., The identification of multiple outliers, Jasa, 88, 782-792, (1993) · Zbl 0797.62025
[16] Edmonds, J.; Karp, R., Theoretical improvements in algorithmic efficiency for network flow problems, J. ACM, 19, 248-264, (1972) · Zbl 0318.90024
[17] Fisher, R.A., The use of multiple measurements in taxonomic problems, Ann. eugenics, 7, 179-188, (1936)
[18] Fraley, C.; Raftery, A.E., Model-based clustering, discriminant analysis, and density estimation, Jasa, 97, 611-631, (2002) · Zbl 1073.62545
[19] Friedman, H.; Rubin, J., On some invariant criteria for grouping data, Jasa, 62, 1159-1178, (1967)
[20] Fulkerson, D., An out-of-kilter method for minimal cost flow problems, J. SIAM, 9, 18-27, (1961) · Zbl 0112.12401
[21] Gabow, H.N.; Tarjan, R.E., Faster scaling algorithms for network problems, SIAM J. comput., 18, 1013-1036, (1989) · Zbl 0679.68079
[22] Gallegos, M.T.; Ritter, G., A robust method for cluster analysis, Ann. statist., 33, 347-380, (2005) · Zbl 1064.62074
[23] Gallegos, M.T.; Ritter, G., Trimming algorithms for clustering contaminated grouped data and their robustness, Adv. data anal. classif., 3, 135-167, (2009) · Zbl 1284.62372
[24] García-Escudero, L.A.; Gordaliza, A.; Matrán, C.; Mayo-Iscar, A., A general trimming approach to robust cluster analysis, Ann. stat., 36, 1324-1345, (2008) · Zbl 1360.62328
[25] Garey, M.R.; Johnson, D.S., Computers and intractability, (1979), Freeman San Francisco · Zbl 0411.68039
[26] Goldberg, A.V.; Tarjan, R.E., Finding minimum-cost circulations by successive approximation, Math. oper. res., 15, 430-466, (1990) · Zbl 0727.90024
[27] Gordon, A.D., ()
[28] Gusfield, D.; Martel, C.; Fernandez-Baca, D., Fast algorithms for bipartite network flow, SIAM J. comput., 16, 237-251, (1987) · Zbl 0617.90082
[29] Hartigan, J.A., Asymptotic distributions for clustering criteria, Ann. stat., 6, 117-131, (1978) · Zbl 0377.62033
[30] Hathaway, R.J., A constrained formulation of maximum-likelihood estimation for normal mixture distributions, Ann. stat., 13, 795-800, (1985) · Zbl 0576.62039
[31] Hung, M.S.; Rom, W.O., Solving the assignment problem by relaxation, Oper. res., 28, 969-982, (1980) · Zbl 0441.90062
[32] Jancey, R., Multidimensional group analysis, Austral. J. botany, 14, 127-130, (1966)
[33] Kéribin, C., Consistent estimation of the order of mixture models, Sankhyā 62 ser. A, 49-66, (2000) · Zbl 1081.62516
[34] Kleinschmidt, P.; Lee, C.W.; Schannath, H., Transportation problems which can be solved by use of Hirsch paths for the dual problem, Math. program., 37, 153-168, (1987) · Zbl 0642.90070
[35] Knuth, D.E., The art of computer programming, vol. 2, (1981), Addison-Wesley Reading, Menlo Park, London, Amsterdam, Don Mills, Sydney
[36] Lawler, E., Combinatorial optimization: networks and matroids, (1976), Holt, Rinehart and Winston New York · Zbl 0413.90040
[37] Li, B., A new approach to cluster analysis: the clustering-function-based method, J. R. stat. soc., series B, 68, 457-476, (2006) · Zbl 1100.62068
[38] MacQueen, J., 1967. Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symp. Math. Statist. Probab. pp. 281-297 · Zbl 0214.46201
[39] Mardia, K.; Kent, T.; Bibby, J., Multivariate analysis, (1979), Academic Press London, New York, Toronto, Sydney, San Francisco · Zbl 0432.62029
[40] McLachlan, G.J.; Peel, D., Finite mixture models, (2000), Wiley New York etc · Zbl 0963.62061
[41] Mecklin, C.J.; Mundfrom, D.J., An appraisal and bibliography of tests for multivariate normality, Internat. statist. rev., 72, 1, 123-138, (2004) · Zbl 1211.62095
[42] Milligan, G.; Cooper, M., An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50, 159-179, (1985)
[43] Neykov, N.; Filzmoser, P.; Dimova, R.; Neytchev, P., Robust Fitting of mixtures using the trimmed likelihood estimator, Csda, 52, 299-308, (2007) · Zbl 1328.62033
[44] Orlin, J.B., 1988. A faster strongly polynomial minimum cost flow algorithm. In: Proc. 20th ACM Symp. Theory of Computing, pp. 377-387
[45] Papadimitriou, C.H.; Steiglitz, K., Combinatorial optimization, (1982), Prentice-Hall Englewood Cliffs, New Jersey · Zbl 0503.90060
[46] Pollard, D., Strong consistency of \(k\)-means clustering, Ann. stat., 9, 135-140, (1981) · Zbl 0451.62048
[47] Pollard, D., A central limit theorem for \(k\)-means clustering, Ann. probab., 10, 919-926, (1982) · Zbl 0502.62055
[48] Rocke, D.M.; Woodruff, D.L., Identification of outliers in multivariate data, Jasa, 91, 1047-1061, (1996) · Zbl 0882.62049
[49] Rocke, D.M., Woodruff, D.L., 1999. A synthesis of outlier detection and cluster identification. Tech. rep., University of California, Davis http://handel.cipic.ucdavis.edu/ dmrocke/Synth5.pdf
[50] Rousseeuw, P.J., Multivariate estimation with high breakdown point, (), 283-297
[51] Rousseeuw, P.J.; Van Driessen, K., A fast algorithm for the minimum covariance determinant estimator, Technometrics, 41, 212-223, (1999)
[52] Schroeder, A., Analyse d’un mélange de distributions de probabilités de même type, Rev. statist. appl., 24, 39-62, (1976)
[53] Schwarz, G., Estimating the dimension of a model, Ann. stat., 6, 461-464, (1978) · Zbl 0379.62005
[54] Scott, A.; Symons, M.J., Clustering methods based on likelihood ratio criteria, Biometrics, 27, 387-397, (1971)
[55] Sedgewick, R., ()
[56] Steinhaus, H., Sur la division des corps matériels en parties, Bull. acad. polon. sci., 4, 801-804, (1956) · Zbl 0079.16403
[57] Symons, M.J., Clustering criteria and multivariate normal mixtures, Biometrics, 37, 35-43, (1981) · Zbl 0473.62048
[58] Tibshirani, R.; Walther, G.; Hastie, T., Estimating the number of clusters in a data set via the gap statistic, J. R. stat. soc. ser. B, 63, 411-423, (2001) · Zbl 0979.62046
[59] Tokuyama, T., Nakano, J., 1991. Geometric algorithms for a minimum cost assignment problem. In: Proc. 7th ACM Symp. on Computational Geometry, pp. 262-271
[60] Tokuyama, T.; Nakano, J., Efficient algorithms for the hitchcock transportation problem, SIAM J. comput., 24, 563-578, (1995) · Zbl 0831.68109
[61] Tokuyama, T.; Nakano, J., Geometric algorithms for the minimum cost assignment problem, Random structures algorithms, 6, 393-406, (1995) · Zbl 0839.68073
[62] Tso, M.; Kleinschmidt, P.; Mitterreiter, I.; Graham, J., An efficient transportation algorithm for automatic chromosome karyotyping, Pattern. recog. lett., 12, 117-126, (1991)
[63] Ward, J.H., Hierarchical grouping to optimize an objective function, Jasa, 58, 236-244, (1963)
[64] Wolfe, J., Pattern clustering by multivariate mixture analysis, Multivariate behavioral res., 5, 329-350, (1970)
[65] Woodruff, D.L.; Reiners, T., Experiments with, and on, algorithms for maximum likelihood clustering, Csda, 47, 237-252, (2004) · Zbl 1429.62269
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.