×

Robust variable selection for model-based learning in presence of adulteration. (English) Zbl 07345450

Summary: The problem of identifying the most discriminating features when performing supervised learning has been extensively investigated. In particular, several methods for variable selection have been proposed in model-based classification. The impact of outliers and wrongly labeled units on the determination of relevant predictors has instead received far less attention, with almost no dedicated methodologies available. Two robust variable selection approaches are introduced: one that embeds a robust classifier within a greedy-forward selection procedure and the other based on the theory of maximum likelihood estimation and irrelevance. The former recasts the feature identification as a model selection problem, while the latter regards the relevant subset as a model parameter to be estimated. The benefits of the proposed methods, in contrast with non-robust solutions, are assessed via an experiment on synthetic data. An application to a high-dimensional classification problem of contaminated spectroscopic data is presented.

MSC:

62-XX Statistics
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Andrews, J. L.; McNicholas, P. D., Variable selection for clustering and classification, J. Classification, 31, 2, 136-153 (2014) · Zbl 1360.62310
[2] Banfield, J. D.; Raftery, A. E., Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 3, 803 (1993) · Zbl 0794.62034
[3] Bellman, R., Dynamic Programming (1957), Rand Corporation research study. Princeton University Press · Zbl 0077.13605
[4] Bensmail, H.; Celeux, G., Regularized Gaussian Discriminant Analysis through Eigenvalue Decomposition, J. Amer. Statist. Assoc., 91, 436, 1743-1748 (1996) · Zbl 0885.62068
[5] Blum, A. L.; Langley, P., Selection of relevant features and examples in machine learning, Artificial Intelligence, 97, 1-2, 245-271 (1997) · Zbl 0904.68142
[6] Bommert, A.; Sun, X.; Bischl, B.; Rahnenführer, J.; Lang, M., Benchmark for filter methods for feature selection in high-dimensional classification data, Comput. Statist. Data Anal., 143, Article 106839 pp. (2020) · Zbl 07135552
[7] Bouveyron, C.; Brunet-Saumard, C., Model-based clustering of high-dimensional data: A review, Comput. Statist. Data Anal., 71, 52-78 (2014) · Zbl 1471.62032
[8] Bouveyron, C.; Celeux, G.; Murphy, T. B.; Raftery, A. E., Model-Based Clustering and Classification for Data Science, Vol. 50 (2019), Cambridge University Press
[9] Brenchley, J. M.; Hörchner, U.; Kalivas, J. H., Wavelength selection characterization for NIR spectra, Appl. Spectrosc., 51, 5, 689-699 (1997)
[10] Brown, P. J., Wavelength selection in multicomponent near-infrared calibration, J. Chemometr., 6, 3, 151-161 (1992)
[11] Cappozzo, A.; Greselin, F.; Murphy, T. B., A robust approach to model-based classification based on trimming and constraints, Adv. Data Anal. Classif., 14, 2, 327-354 (2020) · Zbl 1474.62215
[12] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Pattern Recognit., 28, 5, 781-793 (1995)
[13] Celeux, G.; Maugis-Rabusseau, C.; Sedki, M., Variable selection in model-based clustering and discriminant analysis with a regularization approach, Adv. Data Anal. Classif., 13, 1, 259-278 (2019) · Zbl 1474.62216
[14] Cerioli, A.; Farcomeni, A.; Riani, M., Wild adaptive trimming for robust estimation and cluster analysis, Scand. J. Stat., 46, 1, 235-256 (2019) · Zbl 1417.62169
[15] Cerioli, A.; Riani, M.; Atkinson, A. C.; Corbellini, A., The power of monitoring: how to make the most of a contaminated multivariate sample, Stat. Methods Appl., 27, 4, 661-666 (2018) · Zbl 1428.62217
[16] Chang, W.-C., On using principal components before separating a mixture of two multivariate normal distributions, Appl. Stat., 32, 3, 267 (1983) · Zbl 0538.62050
[17] Chiang, L. H.; Pell, R. J., Genetic algorithms combined with discriminant analysis for key variable identification, J. Process Control, 14, 2, 143-155 (2004)
[18] Dan, G., Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology (1997), Press Syndacate of the Univestity of Cambridge: Press Syndacate of the Univestity of Cambridge Cambridge, UK · Zbl 0934.68103
[19] Dash, M.; Liu, H., Feature selection for classification, Intell. Data Anal., 1, 1-4, 131-156 (1997)
[20] Dean, N.; Murphy, T. B.; Downey, G., Using unlabelled data to update classification rules with applications in food authenticity studies, J. R. Stat. Soc. Ser. C. Appl. Stat., 55, 1, 1-14 (2006) · Zbl 05188723
[21] Dotto, F.; Farcomeni, A.; García-Escudero, L. A.; Mayo-Iscar, A., A reweighting approach to robust clustering, Stat. Comput., 28, 2, 477-493 (2018) · Zbl 1384.62193
[22] Emerson, J. W.; Green, W. A.; Schloerke, B.; Crowley, J.; Cook, D.; Hofmann, H.; Wickham, H., The generalized pairs plot, J. Comput. Graph. Statist., 22, 1, 79-91 (2013)
[23] Farcomeni, A., Robust constrained clustering in presence of entry-wise outliers, Technometrics, 56, 1, 102-111 (2014)
[24] Fernández Pierna, J. A.; Dardenne, P., Chemometric contest at ‘Chimiométrie 2005’: A discrimination study, Chemometr. Intell. Lab. Syst., 86, 2, 219-223 (2007)
[25] Fernández Pierna, J. A.; Volery, P.; Besson, R.; Baeten, V.; Dardenne, P., Classification of modified starches by Fourier Transform Infrared spectroscopy using Support Vector Machines, J. Agricult. Food Chem., 53, 17, 6581-6585 (2005)
[26] Fop, M.; Murphy, T. B., Variable selection methods for model-based clustering, Stat. Surv., 12, 18-65 (2018) · Zbl 06875306
[27] Fraley, C.; Raftery, A. E., Model-based clustering, discriminant analysis, and density estimation, J. Amer. Statist. Assoc., 97, 458, 611-631 (2002) · Zbl 1073.62545
[28] Gallegos, M. T.; Ritter, G., A robust method for cluster analysis, Ann. Statist., 33, 1, 347-380 (2005) · Zbl 1064.62074
[29] García-Escudero, L. A.; Gordaliza, A.; Matrán, C.; Mayo-Iscar, A., A general trimming approach to robust cluster Analysis, Ann. Statist., 36, 3, 1324-1345 (2008) · Zbl 1360.62328
[30] García-Escudero, L. A.; Gordaliza, A.; Matrán, C.; Mayo-Iscar, A., Exploring the number of groups in robust model-based clustering, Stat. Comput., 21, 4, 585-599 (2011) · Zbl 1221.62093
[31] Gordaliza, A., Best approximations to random variables based on trimming procedures, J. Approx. Theory, 64, 2, 162-180 (1991) · Zbl 0745.41030
[32] Guyon, I.; Aliferis, C., Causal feature selection, (Computational Methods of Feature Selection (2007), Chapman and Hall/CRC), 79-102
[33] Hamming, R. W., Error detecting and error correcting codes, Bell Syst. Tech. J., 29, 2, 147-160 (1950) · Zbl 1402.94084
[34] Indahl, U.; Næs, T., A variable selection strategy for supervised classification with continuous spectroscopic data, J. Chemometr., 18, 2, 53-61 (2004)
[35] John, G. H.; Kohavi, R.; Pfleger, K., Irrelevant features and the subset selection problem, (Machine Learning Proceedings 1994 (1994), Elsevier), 121-129
[36] Kass, R. E., Bayes factors in practice, Statistician, 42, 5, 551 (1993)
[37] Kass, R. E.; Raftery, A. E., Bayes factors, J. Amer. Statist. Assoc., 90, 430, 773 (1995) · Zbl 0846.62028
[38] Kohavi, R.; John, G. H., Wrappers for feature subset selection, Artificial Intelligence, 97, 1-2, 273-324 (1997) · Zbl 0904.68143
[39] Krusińska, E.; Liebhart, J., Robust selection of the most discriminative variables in the dichotomous problem with application to some respiratory disease data, Biom. J., 30, 3, 295-303 (1988)
[40] Liu, H.; Motoda, H., Computational Methods of Feature Selection (2007), CRC Press · Zbl 1130.62118
[41] Mardia, K. V.; Kent, J. T.; Bibby, J. M., Multivariate Analysis (1979), Academic Press London: Academic Press London New York · Zbl 0432.62029
[42] Maugis, C.; Celeux, G.; Martin-Magniette, M.-L., Variable selection for clustering with Gaussian mixture models, Biometrics, 65, 3, 701-709 (2009) · Zbl 1172.62021
[43] Maugis, C.; Celeux, G.; Martin-Magniette, M. L., Variable selection in model-based clustering: A general variable role modeling, Comput. Statist. Data Anal., 53, 11, 3872-3882 (2009) · Zbl 1453.62154
[44] Maugis, C.; Celeux, G.; Martin-Magniette, M. L., Variable selection in model-based discriminant analysis, J. Multivariate Anal., 102, 10, 1374-1387 (2011) · Zbl 1219.62103
[45] McLachlan, G. J., (Discriminant Analysis and Statistical Pattern Recognition. Discriminant Analysis and Statistical Pattern Recognition, Wiley Series in Probability and Statistics, vol. 544 (1992), John Wiley & Sons, Inc: John Wiley & Sons, Inc Hoboken, NJ, USA) · Zbl 0850.62481
[46] Murphy, T. B.; Dean, N.; Raftery, A. E., Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications, Ann. Appl. Stat., 4, 1, 396-421 (2010) · Zbl 1189.62105
[47] Neykov, N.; Filzmoser, P.; Dimova, R.; Neytchev, P., Robust fitting of mixtures using the trimmed likelihood estimator, Comput. Statist. Data Anal., 52, 1, 299-308 (2007) · Zbl 1328.62033
[48] Pacheco, J.; Casado, S.; Núñez, L.; Gómez, O., Analysis of new variable selection methods for discriminant analysis, Comput. Statist. Data Anal., 51, 3, 1463-1478 (2006) · Zbl 1157.62442
[49] Raftery, A. E.; Dean, N., Variable selection for model-based clustering, J. Amer. Statist. Assoc., 101, 473, 168-178 (2006) · Zbl 1118.62339
[50] Raftery, A.; Hoeting, J.; Volinsky, C.; Painter, I.; Yeung, K. Y., BMA: Bayesian Model Averaging (2018)
[51] Rand, W. M., Objective criteria for the evaluation of clustering methods, J. Amer. Statist. Assoc., 66, 336, 846 (1971)
[52] Reid, L. M.; O’Donnell, C. P.; Downey, G., Recent technological advances for the determination of food authenticity, Trends Food Sci. Technol., 17, 7, 344-353 (2006)
[53] Riani, M.; Atkinson, A. C.; Cerioli, A.; Corbellini, A., Efficient robust methods via monitoring for clustering and multivariate data analysis, Pattern Recognit., 88, 246-260 (2019)
[54] Ritter, G., Robust Cluster Analysis and Variable Selection (2014), Chapman and Hall/CRC
[55] Rousseeuw, P. J., Least median of squares regression, J. Amer. Statist. Assoc., 79, 388, 871-880 (1984) · Zbl 0547.62046
[56] Rousseeuw, P. J.; Bossche, W. V.D., Detecting deviating data cells, Technometrics, 60, 2, 135-145 (2018)
[57] Rousseeuw, P. J.; Driessen, K. V., A fast algorithm for the minimum covariance determinant estimator, Technometrics, 41, 3, 212-223 (1999)
[58] Saeys, Y.; Inza, I.; Larranaga, P., A review of feature selection techniques in bioinformatics, Bioinformatics, 23, 19, 2507-2517 (2007)
[59] Schwarz, G., Estimating the dimension of a model, Ann. Statist., 6, 2, 461-464 (1978) · Zbl 0379.62005
[60] Scrucca, L.; Fop, M.; Murphy, T. B.; Raftery, A. E., Mclust 5: Clustering, classification and density estimation using Gaussian Finite Mixture Models, R J., 8, 1, 289-317 (2016)
[61] Scrucca, L.; Raftery, A. E., Clustvarsel : A package implementing variable selection for Gaussian model-based clustering in R, J. Stat. Softw., 84, 1 (2018)
[62] Strehl, A.; Ghosh, J., Cluster ensembles—a knowledge reuse framework for combining multiple partitions, J. Mach. Learn. Res., 3, Dec, 583-617 (2002) · Zbl 1084.68759
[63] Todorov, V., Robust selection of variables in linear discriminant analysis, Stat. Methods Appl., 15, 3, 395-407 (2007) · Zbl 1181.62096
[64] Wolters, M. A., A genetic algorithm for selection of fixed-size subsets with application to design problems, J. Stat. Softw., 68, Code Snippet 1 (2015)
[65] Yu, L., Feature selection for genomic data analysis, (Computational Methods of Feature Selection (2008)), 337-353
[66] Yu, L.; Liu, H., Efficient feature selection via analysis of relevance and redundancy, J. Mach. Learn. Res., 5, Oct, 1205-1224 (2004) · Zbl 1222.68340
[67] Zhu, X.; Wu, X., Class noise vs. attribute noise: A quantitative study, Artif. Intell. Rev., 22, 3, 177-210 (2004) · Zbl 1069.68587
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.