×

Assessing trimming methodologies for clustering linear regression data. (English) Zbl 1459.62010

Summary: We assess the performance of state-of-the-art robust clustering tools for regression structures under a variety of different data configurations. We focus on two methodologies that use trimming and restrictions on group scatters as their main ingredients. We also give particular care to the data generation process through the development of a flexible simulation tool for mixtures of regressions, where the user can control the degree of overlap between the groups. Level of trimming and restriction factors are input parameters for which appropriate tuning is required. Since we find that incorrect specification of the second-level trimming in the Trimmed CLUSTering REGression model (TCLUST-REG) can deteriorate the performance of the method, we propose an improvement where the second-level trimming is not fixed in advance but is data dependent. We then compare our adaptive version of TCLUST-REG with the Trimmed Cluster Weighted Restricted Model (TCWRM) which provides a powerful extension of the robust clusterwise regression methodology. Our overall conclusion is that the two methods perform comparably, but with notable differences due to the inherent degree of modeling implied by them.

MSC:

62-08 Computational methods for problems pertaining to statistics
62J05 Linear regression; mixed models
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F35 Robustness and adaptive procedures (parametric inference)

Software:

FSDA; MixSim; TCLUST; AS 155
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Banfield, J.; Raftery, A., Model-based gaussian and non-gaussian clustering, Biometrics, 49, 803-821, (1993) · Zbl 0794.62034
[2] Barabesi, L.; Cerasa, A.; Cerioli, A.; Perrotta, D., A new family of tempered distributions, Electron J Stat, 10, 1031-1043, (2016) · Zbl 1357.62072
[3] Barabesi, L.; Cerasa, A.; Perrotta, D.; Cerioli, A., Modeling international trade data with the tweedie distribution for anti-fraud and policy support, Eur J Oper Res, 248, 1031-1043, (2016) · Zbl 1346.62154
[4] Campbell, J., Mixture models and atypical values, Math Geol, 16, 465-477, (1984)
[5] Campbell, J.; Fraley, C.; Murtagh, F.; Raftery, A., Linear flaw detection in woven textiles using model-based clustering, Pattern Recognit Lett, 18, 1539-1548, (1997)
[6] Cerasa, A.; Cerioli, A., Outlier-free merging of homogeneous groups of pre-classified observations under contamination, J Stat Comput Simul, 87, 2997-3020, (2017)
[7] Cerioli, A., Multivariate outlier detection with high-breakdown estimators, J Am Stat Assoc, 105, 147-156, (2010) · Zbl 1397.62167
[8] Cerioli A, Riani M, Atkinson AC, Corbellini A (2017) The power of monitoring: how to make the most of a contaminated multivariate sample. Stat Methods Appl. https://doi.org/10.1007/s10260-017-0409-8
[9] Cerioli, A.; Garcia-Escudero, LA; Mayo-Iscar, A.; Riani, M., Finding the number of normal groups in model-based clustering via constrained likelihoods, J Comput Graph Stat, 27, 404-416, (2018)
[10] Cerioli, A.; Perrotta, D., Robust clustering around regression lines with high density regions, Adv Data Anal Classif, 8, 5-26, (2014)
[11] Dasgupta, A.; Raftery, AE, Detecting features in spatial point processes with clutter via model-based clustering, J Am Stat Assoc, 93, 294-302, (1998) · Zbl 0906.62105
[12] Davies, RB, The distribution of a linear combination of \(\chi ^2\) random variables, J R Stat Soc Ser C (Appl Stat), 29, 323-333, (1980)
[13] DeSarbo, W.; Cron, W., A maximum likelihood methodology for clusterwise linear regression, J Classif, 5, 249-282, (1988) · Zbl 0692.62052
[14] Dotto, F.; Farcomeni, A.; García-Escudero, LA; Mayo-Iscar, A., A reweighting approach to robust clustering, Stat Comput, 28, 477-493, (2018) · Zbl 1384.62193
[15] Farcomeni A, Dotto, F (2018) The power of (extended) monitoring in robust clustering. Stat Methods Appl. https://doi.org/10.1007/s10260-017-0417-8
[16] Fritz, H.; Garca-Escudero, LA; Mayo-Iscar, A., tclust: an R package for a trimming approach to cluster analysis, J Stat Softw, 47, 1-26, (2012)
[17] Fritz, H.; García-Escudero, L.; Mayo-Iscar, A., A fast algorithm for robust constrained clustering, Comput Stat Data Anal, 61, 124-136, (2013) · Zbl 1349.62264
[18] García-Escudero, L.; Gordaliza, A.; Mayo-Iscar, A.; San Martin, R., Robust clusterwise linear regression through trimming, Comput Stat Data Anal, 54, 3057-3069, (2010) · Zbl 1284.62198
[19] García-Escudero, LA; Gordaliza, A.; Greselin, F.; Ingrassia, S.; Mayo-Iscar, A., The joint role of trimming and constraints in robust estimation for mixtures of gaussian factor analyzers, Comput Stat Data Anal, 99, 131-147, (2016) · Zbl 1468.62060
[20] García-Escudero LA, Gordaliza A, Greselin F, Ingrassia S, Mayo-Iscar A (2017a) Robust estimation of mixtures of regressions with random covariates, via trimming and constraints. Stat Comput 27(2):377-402 · Zbl 06697663
[21] García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2017b) Comments on “The power of monitoring: how to make the most of a contaminated multivariate sample”. Stat Methods Appl. https://doi.org/10.1007/s10260-017-0415-x
[22] García-Escudero, LA; Gordaliza, A.; Matrán, C.; Mayo-Iscar, A., A general trimming approach to robust cluster analysis, Ann Stat, 36, 1324-1345, (2008) · Zbl 1360.62328
[23] García-Escudero, LA; Gordaliza, A.; Mayo-Iscar, A.; San Martín, R., Robust clusterwise linear regression through trimming, Comput Stat Data Anal, 54, 3057-3069, (2010) · Zbl 1284.62198
[24] Gershenfeld, N., Nonlinear inference and cluster-weighted modeling, Ann N Y Acad Sci, 808, 18-24, (1997)
[25] Gershenfeld, N.; Schoner, B.; Metois, E., Cluster-weighted modelling for time-series analysis, Nature, 397, 329-332, (1999)
[26] Gordaliza, A., Best approximations to random variables based on trimming procedures, J Approx Theory, 64, 162-180, (1991) · Zbl 0745.41030
[27] Hennig, C., Clusters, outliers, and regression: Fixed point clusters, J Multivar Anal, 86, 183-212, (2003) · Zbl 1020.62051
[28] Ingrassia, S.; Minotti, SC; Vittadini, G., Local statistical modeling via a cluster-weighted approach with elliptical distributions, J Classif, 29, 63-401, (2012) · Zbl 1360.62335
[29] Maitra, R.; Melnykov, V., Simulating data to study performance of finite mixture modeling and clustering algorithms, J Comput Graph Stat, 2, 354-376, (2010)
[30] Melnykov, V.; Chen, W-C; Maitra, R., Mixsim: an R package for simulating data to study performance of clustering algorithms, J Stat Softw, 51, 1-25, (2012)
[31] Neykov, N.; Filzmoser, P.; Dimova, R.; Neytchev, P., Robust fitting of mixtures using the trimmed likelihood estimator, Comput Stat Data Anal, 52, 299-308, (2007) · Zbl 1328.62033
[32] Peel, D.; McLachlan, G., Robust mixture modeling using the \(t\)-distribution, Stat Comput, 10, 335-344, (2000)
[33] Perez, B.; Molina, I.; Pena, D., Outlier detection and robust estimation in linear regression models with fixed group effects, J Stat Comput Simul, 84, 2652-2669, (2014)
[34] Perrotta D, Torti F (2018) Discussion of “The power of monitoring: how to make the most of a contaminated multivariate sample”. Stat Methods Appl. https://doi.org/10.1007/s10260-017-0420-0
[35] Riani, M.; Atkinson, AC; Cerioli, A., Finding an unknown number of multivariate outliers, J R Stat Soc Ser B, 71, 447-466, (2009) · Zbl 1248.62091
[36] Riani, M.; Cerioli, A.; Perrotta, D.; Torti, F., Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library, Adv Data Anal Classif, 9, 461-481, (2015)
[37] Riani, M.; Perrotta, D.; Cerioli, A., The forward search for very large datasets, J Stat Softw, 67, 1-20, (2015)
[38] Riani, M.; Perrotta, D.; Torti, F., FSDA: A MATLAB toolbox for robust analysis and interactive data exploration, Chemom Intell Lab Syst, 116, 17-32, (2012)
[39] Rousseeuw, PJ, Least median of squares regression, J Am Stat Assoc, 79, 871-880, (1984) · Zbl 0547.62046
[40] Rousseeuw, PJ; Driessen, K., A fast algorithm for the minimum covariance determinant estimator, Technometrics, 41, 212-223, (1999)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.