×

A robust approach to model-based classification based on trimming and constraints. Semi-supervised learning in presence of outliers and label noise. (English) Zbl 1474.62215

Summary: In a standard classification framework a set of trustworthy learning data are employed to build a decision rule, with the final aim of classifying unlabelled units belonging to the test set. Therefore, unreliable labelled observations, namely outliers and data with incorrect labels, can strongly undermine the classifier performance, especially if the training size is small. The present work introduces a robust modification to the Model-Based Classification framework, employing impartial trimming and constraints on the ratio between the maximum and the minimum eigenvalue of the group scatter matrices. The proposed method effectively handles noise presence in both response and exploratory variables, providing reliable classification even when dealing with contaminated datasets. A robust information criterion is proposed for model selection. Experiments on real and simulated data, artificially adulterated, are provided to underline the benefits of the proposed method.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F35 Robustness and adaptive procedures (parametric inference)
62C25 Compound decision problems in statistical decision theory
68T05 Learning and adaptive systems in artificial intelligence
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Aitken, AC, A series formula for the roots of algebraic and transcendental equations, Proc R Soc Edinb, 45, 1, 14-22 (1926)
[2] Alimentarius, C., Revised codex standard for honey, Codex stan, 12, 1982 (2001)
[3] Banfield, JD; Raftery, AE, Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 3, 803 (1993) · Zbl 0794.62034
[4] Bensmail, H.; Celeux, G., Regularized Gaussian discriminant analysis through eigenvalue decomposition, J Am Stat Assoc, 91, 436, 1743-1748 (1996) · Zbl 0885.62068
[5] Bohning, D.; Dietz, E.; Schaub, R.; Schlattmann, P.; Lindsay, BG, The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family, Ann Inst Stat Math, 46, 2, 373-388 (1994) · Zbl 0802.62017
[6] Bouveyron, C.; Girard, S., Robust supervised classification with mixture models: learning from data with uncertain labels, Pattern Recognit, 42, 11, 2649-2658 (2009) · Zbl 1175.68313
[7] Browne, RP; McNicholas, PD, Estimating common principal components in high dimensions, Adv Data Anal Classif, 8, 217-226 (2014) · Zbl 1474.62183
[8] Cattell, RB, The scree test for the number of factors, Multivar Behav Res, 1, 2, 245-276 (1966)
[9] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Pattern Recognit, 28, 5, 781-793 (1995)
[10] Cerioli, A.; García-Escudero, LA; Mayo-Iscar, A.; Riani, M., Finding the number of normal groups in model-based clustering via constrained likelihoods, J Comput Gr Stat, 27, 2, 404-416 (2018)
[11] Cortes, C.; Vapnik, V., Support-vector networks, Mach Learn, 20, 3, 273-297 (1995) · Zbl 0831.68098
[12] Cuesta-Albertos, JA; Gordaliza, A.; Matrán, C., Trimmed k-means: an attempt to robustify quantizers, Ann Stat, 25, 2, 553-576 (1997) · Zbl 0878.62045
[13] Dean, N.; Murphy, TB; Downey, G., Using unlabelled data to update classification rules with applications in food authenticity studies, J R Stat Soc Ser C Appl Stat, 55, 1, 1-14 (2006) · Zbl 05188723
[14] Dempster, A.; Laird, N.; Rubin, D., Maximum likelihood from incomplete data via the EM algorithm, J R Stat Soc, 39, 1, 1-38 (1977) · Zbl 0364.62022
[15] Dotto, F.; Farcomeni, A., Robust inference for parsimonious model-based clustering, J Stat Comput Simul, 89, 3, 414-442 (2019) · Zbl 07193731
[16] Dotto, F.; Farcomeni, A.; García-Escudero, LA; Mayo-Iscar, A., A reweighting approach to robust clustering, Stat Comput, 28, 2, 477-493 (2018) · Zbl 1384.62193
[17] Downey, G., Authentication of food and food ingredients by near infrared spectroscopy, J Near Infrared Spectrosc, 4, 1, 47 (1996)
[18] Fop, M.; Murphy, TB; Raftery, AE, mclust 5: clustering, classification and density estimation using Gaussian finite mixture models, R J, XX, August, 1-29 (2016)
[19] Fraley, C.; Raftery, AE, Model-based clustering, discriminant analysis, and density estimation, J Am Stat Assoc, 97, 458, 611-631 (2002) · Zbl 1073.62545
[20] Freund, Y.; Schapire, RE, A decision-theoretic generalization of on-line learning and an application to boosting, J Comput Syst Sci, 55, 1, 119-139 (1997) · Zbl 0880.68103
[21] Fritz, H.; García-Escudero, LA; Mayo-Iscar, A., tclust : an R package for a trimming approach to cluster analysis, J Stat Softw, 47, 12, 1-26 (2012)
[22] Fritz, H.; García-Escudero, LA; Mayo-Iscar, A., A fast algorithm for robust constrained clustering, Comput Stat Data Anal, 61, 124-136 (2013) · Zbl 1349.62264
[23] Gallegos MT (2002) Maximum likelihood clustering with outliers. In: Classification, clustering, and data analysis, Springer, pp 247-255 · Zbl 1032.62059
[24] García-Escudero, LA; Gordaliza, A.; Matrán, C.; Mayo-Iscar, A., A general trimming approach to robust cluster Analysis, Ann Stat, 36, 3, 1324-1345 (2008) · Zbl 1360.62328
[25] García-Escudero, LA; Gordaliza, A.; Matrán, C.; Mayo-Iscar, A., A review of robust clustering methods, Adv Data Anal Classif, 4, 2-3, 89-109 (2010) · Zbl 1284.62375
[26] García-Escudero, LA; Gordaliza, A.; Matrán, C.; Mayo-Iscar, A., Exploring the number of groups in robust model-based clustering, Stat Comput, 21, 4, 585-599 (2011) · Zbl 1221.62093
[27] García-Escudero, LA; Gordaliza, A.; Mayo-Iscar, A., A constrained robust proposal for mixture modeling avoiding spurious solutions, Adv Data Anal Classif, 8, 1, 27-43 (2014) · Zbl 1459.62110
[28] García-Escudero, LA; Gordaliza, A.; Matrán, C.; Mayo-Iscar, A., Avoiding spurious local maximizers in mixture modeling, Stat Comput, 25, 3, 619-633 (2015) · Zbl 1331.62100
[29] García-Escudero, LA; Gordaliza, A.; Greselin, F.; Ingrassia, S.; Mayo-Iscar, A., The joint role of trimming and constraints in robust estimation for mixtures of Gaussian factor analyzers, Comput Stat Data Anal, 99, 131-147 (2016) · Zbl 1468.62060
[30] García-Escudero, LA; Gordaliza, A.; Greselin, F.; Ingrassia, S.; Mayo-Iscar, A., Eigenvalues and constraints in mixture modeling: geometric and computational issues, Adv Data Anal Classif, 12, 1-31 (2017)
[31] Gordaliza, A., Best approximations to random variables based on trimming procedures, J Approx Theory, 64, 2, 162-180 (1991) · Zbl 0745.41030
[32] Gordaliza, A., On the breakdown point of multivariate location estimators based on trimming procedures, Stat Probab Lett, 11, 5, 387-394 (1991) · Zbl 0732.62051
[33] Hastie, T.; Tibshirani, R., Discriminant analysis by Gaussian mixtures, J R Stat Soc Ser B (Methodol), 58, 1, 155-176 (1996) · Zbl 0850.62476
[34] Hawkins, DM; McLachlan, GJ, High-breakdown linear discriminant analysis, J Am Stat Assoc, 92, 437, 136 (1997) · Zbl 0889.62052
[35] Hickey, RJ, Noise modelling and evaluating learning from examples, Artif Intell, 82, 1-2, 157-179 (1996)
[36] Hubert, M.; Debruyne, M.; Rousseeuw, PJ, Minimum covariance determinant and extensions, Wiley Interdiscip Rev Comput Stat, 10, 3, 1-11 (2018)
[37] Ingrassia, S., A likelihood-based constrained algorithm for multivariate normal mixture models, Stat Methods Appl, 13, 2, 151-166 (2004) · Zbl 1205.62066
[38] Kelly, JD; Petisco, C.; Downey, G., Application of Fourier transform midinfrared spectroscopy to the discrimination between Irish artisanal honey and such honey adulterated with various sugar syrups, J Agric Food Chem, 54, 17, 6166-6171 (2006)
[39] Mardia, KV; Kent, JT; Bibby, JM, Multivariate analysis (1979), New York: Academic Press, New York
[40] Maronna, R.; Jacovkis, PM, Multivariate clustering procedures with variable metrics, Biometrics, 30, 3, 499 (1974) · Zbl 0285.62036
[41] McLachlan, GJ, Discriminant analysis and statistical pattern recognition (1992), Hoboken: Wiley, Hoboken
[42] McLachlan, GJ; Krishnan, T., The EM algorithm and extensions (2008), Hoboken: Wiley, Hoboken
[43] McLachlan GJ, Peel D (1998) Robust cluster analysis via mixtures of multivariate t-distributions. In: Joint IAPR international workshops on statistical techniques in pattern recognition and structural and syntactic pattern recognition. Springer, Berlin, pp 658-666
[44] McNicholas, PD, Mixture model-based classification (2016), Boca Raton: CRC Press, Boca Raton
[45] Menardi, G., Density-based Silhouette diagnostics for clustering methods, Stat Comput, 21, 3, 295-308 (2011) · Zbl 1255.62179
[46] Neykov, N.; Filzmoser, P.; Dimova, R.; Neytchev, P., Robust fitting of mixtures using the trimmed likelihood estimator, Comput Stat Data Anal, 52, 1, 299-308 (2007) · Zbl 1328.62033
[47] Peel, D.; McLachlan, GJ, Robust mixture modelling using the t distribution, Stat Comput, 10, 4, 339-348 (2000)
[48] Prati, RC; Luengo, J.; Herrera, F., Emerging topics and challenges of learning from noisy data in nonstandard classification: a survey beyond binary class noise, Knowl Inf Syst, 60, 1, 63-97 (2019)
[49] R Core Team (2018) R: a language and environment for statistical computing
[50] Rousseeuw, PJ; Driessen, KV, A fast algorithm for the minimum covariance determinant estimator, Technometrics, 41, 3, 212-223 (1999)
[51] Russell N, Cribbin L, Murphy TB (2014) upclass: an R package for updating model-based classification rules. Cran R-Project Org
[52] Schwarz, G., Estimating the dimension of a model, Ann Stat, 6, 2, 461-464 (1978) · Zbl 0379.62005
[53] Thomson, G., The factorial analysis of human ability, Br J Educ Psychol, 9, 2, 188-195 (1939)
[54] Vanden Branden, K.; Hubert, M., Robust classification in high dimensions based on the SIMCA Method, Chemom Intell Lab Syst, 79, 1-2, 10-21 (2005)
[55] Wu, X., Knowledge acquisition from databases (1995), Westport: Intellect books, Westport
[56] Zhu, X.; Wu, X., Class noise vs. attribute noise: a quantitative study, Artif Intell Rev, 22, 3, 177-210 (2004) · Zbl 1069.68587
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.