×

Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library. (English) Zbl 1414.62267

Summary: We extend the capabilities of MixSim, a framework which is useful for evaluating the performance of clustering algorithms, on the basis of measures of agreement between data partitioning and flexible generation methods for data, outliers and noise. The peculiarity of the method is that data are simulated from normal mixture distributions on the basis of pre-specified synthesis statistics on an overlap measure, defined as a sum of pairwise misclassification probabilities. We provide new tools which enable us to control additional overlapping statistics and departures from homogeneity and sphericity among groups, together with new outlier contamination schemes. The output of this extension is a more flexible framework for generation of data to better address modern robust clustering scenarios in presence of possible contamination. We also study the properties and the implications that this new way of simulating clustering data entails in terms of coverage of space, goodness of fit to theoretical distributions, and degree of convergence to nominal values. We demonstrate the new features using our MATLAB implementation that we have integrated in the Flexible Statistics for Data Analysis (FSDA) toolbox for MATLAB. With MixSim, FSDA now integrates in the same environment state of the art robust clustering algorithms and principled routines for their evaluation and calibration. A spin off of our work is a general complex routine, translated from C language to MATLAB, to compute the distribution function of a linear combinations of non central \(\chi^2\) random variables which is at the core of MixSim and has its own interest for many test statistics.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F35 Robustness and adaptive procedures (parametric inference)
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Banfield, J.; Raftery, A., Model-based gaussian and non-gaussian clustering, Biometrics, 49, 803-821, (1993) · Zbl 0794.62034
[2] Biernacki, C.; Celeux, G.; Govaert, G.; Langrognet, F., Model-based cluster and discriminant analysis with the mixmod software, Comput Stat Data Anal, 51, 587-600, (2006) · Zbl 1157.62431
[3] Cerioli, A., Testing mutual independence between two discrete-valued spatial processes: a correction to Pearson chi-squared, Biometrics, 58, 888-897, (2002) · Zbl 1210.62224
[4] Cerioli, A.; Perrotta, D., Robust clustering around regression lines with high density regions, Adv Data Anal Classif, 8, 5-26, (2014)
[5] Davies, RB, Numerical inversion of a characteristic function, Biometrika, 60, 415-417, (1973) · Zbl 0263.65115
[6] Davies, RB, The distribution of a linear combination of \(χ ^2\) random variables, Appl Stat, 29, 323-333, (1980) · Zbl 0473.62025
[7] Duchesne, P.; Micheaux, PL, Computing the distribution of quadratic forms: further comparisons between the Liu-Tang-Zhang approximation and exact methods, Comput Stat Data Anal, 54, 858-862, (2010) · Zbl 1465.62010
[8] Farcomeni, A., Robust constrained clustering in presence of entry-wise outliers, Technometrics, 56, 102-111, (2014)
[9] Fraley, C.; Raftery, A., Model-based clustering, discriminant analysis, and density estimation, J Am Stat Assoc, 97, 611-631, (2002) · Zbl 1073.62545
[10] Fritz, H.; García-Escudero, LA; Mayo-Iscar, A., A fast algorithm for robust constrained clustering, Comput Stat Data Anal, 61, 124-136, (2013) · Zbl 1349.62264
[11] Garcia-Escudero, L.; Gordaliza, A.; Matran, C.; Mayo-Iscar, A., A general trimming approach to robust cluster analysis, Annal Stat, 36, 1324-1345, (2008) · Zbl 1360.62328
[12] Garcia-Escudero, L.; Gordaliza, A.; Matran, C.; Mayo-Iscar, A., A review of robust clustering methods, Adv Data Anal Classif, 4, 89-109, (2010) · Zbl 1284.62375
[13] Garcia-Escudero, LA; Gordaliza, A.; Mayo-Iscar, A., A constrained robust proposal for mixture modeling avoiding spurious solutions, Adv Data Anal Classif, 8, 27-43, (2014)
[14] Hennig, C., What are the true clusters?, Pattern Recogniti Lett, 64, 53-62, (2015)
[15] Lindsay BG (1995) Mixture Models: theory, geometry, and applications. Institute for Mathematical Statistics, Hayward · Zbl 1163.62326
[16] Maitra, R.; Melnykov, V., Simulating data to study performance of finite mixture modeling and clustering algorithms, J Comput Graph Stat, 19, 354-376, (2010)
[17] McLachlan, G.; Krishnaiah, P. (ed.); Kanal, L. (ed.), The classification and mixture maximum likelihood approaches to cluster analysis, No. 2, 199-208, (1982), Amsterdam
[18] McLachlan, G.; Peel, D., The emmix algorithm for the fitting of normal and t-components, J Stat Softw, 4, 1-14, (1999)
[19] McLachlan G, Peel D (2004) Finite mixture models. Applied probability and statistics. Wiley, Hoboken
[20] Melnykov, V.; Chen, W-C; Maitra, R., Mixsim: an R package for simulating data to study performance of clustering algorithms, J Stat Softw, 51, 1-25, (2012)
[21] Melnykov, V.; Maitra, R., CARP: software for fishing out good clustering algorithms, J Mach Learn Res, 12, 69-73, (2011) · Zbl 1280.68183
[22] Melnykov, V. and R. Maitra (2013) CARP: the clustering algorithms referee package, version 3.3 manual. http://www.mloss.org
[23] Qiu, W.; Joe, H., Generation of random clusters with specified degree of separation, J Classif, 23, 315-334, (2006) · Zbl 1336.62189
[24] Riani, M.; Atkinson, A.; Perrotta, D., A parametric framework for the comparison of methods of very robust regression, Stat Sci, 29, 128-143, (2014) · Zbl 1332.62245
[25] Riani, M.; Perrotta, D.; Torti, F., Fsda: a matlab toolbox for robust analysis and interactive data exploration, Chemom Intell Lab Syst, 116, 17-32, (2012)
[26] Ritter G (2014) Robust cluster analysis and variable selection. CRC Press, Boca Raton
[27] Steinley, D.; Henson, R., Oclus: an analytic method for generating clusters with known overlap, J Classif, 22, 221-250, (2005) · Zbl 1336.62191
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.