Anomaly and novelty detection for robust semi-supervised learning. (English) Zbl 1461.62015

Summary: Three important issues are often encountered in Supervised and Semi-Supervised Classification: class memberships are unreliable for some training units (label noise), a proportion of observations might depart from the main structure of the data (outliers) and new groups in the test set may have not been encountered earlier in the learning phase (unobserved classes). The present work introduces a robust and adaptive Discriminant Analysis rule, capable of handling situations in which one or more of the aforementioned problems occur. Two EM-based classifiers are proposed: the first one that jointly exploits the training and test sets (transductive approach), and the second one that expands the parameter estimation using the test set, to complete the group structure learned from the training set (inductive approach). Experiments on synthetic and real data, artificially adulterated, are provided to underline the benefits of the proposed method.


62-08 Computational methods for problems pertaining to statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI arXiv


[1] Aitken, AC, A series formula for the roots of algebraic and transcendental equations, Proc. R. Soc. Edinb., 45, 1, 14-22 (1926) · JFM 51.0096.03
[2] Akaike, H., A new look at the statistical model identification, IEEE Trans. Autom. Control, 19, 6, 716-723 (1974) · Zbl 0314.62039
[3] Banfield, JD; Raftery, AE, Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 3, 803 (1993) · Zbl 0794.62034
[4] Bensmail, H.; Celeux, G., Regularized Gaussian discriminant analysis through eigenvalue decomposition, J. Am. Stat. Assoc., 91, 436, 1743-1748 (1996) · Zbl 0885.62068
[5] Biernacki, C., Degeneracy in the maximum likelihood estimation of univariate Gaussian mixtures for grouped data and behaviour of the EM algorithm, Scand. J. Stat., 34, 3, 569-586 (2007) · Zbl 1150.62010
[6] Böhning, D.; Dietz, E.; Schaub, R.; Schlattmann, P.; Lindsay, BG, The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family, Ann. Inst. Stat. Math., 46, 2, 373-388 (1994) · Zbl 0802.62017
[7] Bokulich, NA; Thorngate, JH; Richardson, PM; Mills, DA, Microbial biogeography of wine grapes is conditioned by cultivar, vintage, and climate, Proc. National Acad. Sci., 111, 1, E139-E148 (2014)
[8] Bokulich, NA; Collins, T.; Masarweh, C.; Allen, G.; Heymann, H.; Ebeler, SE; Mills, DA, Fermentation behavior suggest microbial contribution to regional, MBio, 7, 3, 1-12 (2016)
[9] Bolyen, E.; Rideout, JR; Dillon, MR; Bokulich, NA; Abnet, CC; Al-Ghalith, GA; Alexander, H.; Alm, EJ; Arumugam, M.; Asnicar, F.; Bai, Y.; Bisanz, JE; Bittinger, K.; Brejnrod, A.; Brislawn, CJ; Brown, CT; Callahan, BJ; Caraballo-Rodríguez, AM; Chase, J.; Cope, EK; Da Silva, R.; Diener, C.; Dorrestein, PC; Douglas, GM; Durall, DM; Duvallet, C.; Edwardson, CF; Ernst, M.; Estaki, M.; Fouquier, J.; Gauglitz, JM; Gibbons, SM; Gibson, DL; Gonzalez, A.; Gorlick, K.; Guo, J.; Hillmann, B.; Holmes, S.; Holste, H.; Huttenhower, C.; Huttley, GA; Janssen, S.; Jarmusch, AK; Jiang, L.; Kaehler, BD; Kang, KB; Keefe, CR; Keim, P.; Kelley, ST; Knights, D.; Koester, I.; Kosciolek, T.; Kreps, J.; Langille, MG; Lee, J.; Ley, R.; Liu, YX; Loftfield, E.; Lozupone, C.; Maher, M.; Marotz, C.; Martin, BD; McDonald, D.; McIver, LJ; Melnik, AV; Metcalf, JL; Morgan, SC; Morton, JT; Naimey, AT; Navas-Molina, JA; Nothias, LF; Orchanian, SB; Pearson, T.; Peoples, SL; Petras, D.; Preuss, ML; Pruesse, E.; Rasmussen, LB; Rivers, A.; Robeson, MS; Rosenthal, P.; Segata, N.; Shaffer, M.; Shiffer, A.; Sinha, R.; Song, SJ; Spear, JR; Swafford, AD; Thompson, LR; Torres, PJ; Trinh, P.; Tripathi, A.; Turnbaugh, PJ; Ul-Hasan, S.; van der Hooft, JJ; Vargas, F.; Vázquez-Baeza, Y.; Vogtmann, E.; von Hippel, M.; Walters, W.; Wan, Y.; Wang, M.; Warren, J.; Weber, KC; Williamson, CH; Willis, AD; Xu, ZZ; Zaneveld, JR; Zhang, Y.; Zhu, Q.; Knight, R.; Caporaso, JG, Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2, Nat. Biotechnol., 37, 8, 852-857 (2019)
[10] Bouveyron, C., Adaptive mixture discriminant analysis for supervised learning with unobserved classes, J. Classif., 31, 1, 49-84 (2014) · Zbl 1360.62315
[11] Bouveyron, C.; Girard, S., Robust supervised classification with mixture models: learning from data with uncertain labels, Pattern Recognit., 42, 11, 2649-2658 (2009) · Zbl 1175.68313
[12] Calle, ML, Statistical Analysis of Metagenomics Data, Genom. Inform., 17, 1, e6 (2019)
[13] Cappozzo, A.; Greselin, F.; Murphy, TB, A robust approach to model-based classification based on trimming and constraints, Adv. Data Anal. Classif. (2019) · Zbl 1436.62245
[14] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Pattern Recognit., 28, 5, 781-793 (1995)
[15] Cerioli, A.; García-Escudero, LA; Mayo-Iscar, A.; Riani, M., Finding the number of normal groups in model-based clustering via constrained likelihoods, J. Comput. Graph. Stat., 27, 2, 404-416 (2018)
[16] Cerioli, A.; Farcomeni, A.; Riani, M., Wild adaptive trimming for robust estimation and cluster analysis, Scand. J. Stat., 46, 1, 235-256 (2019) · Zbl 1417.62169
[17] Chandola, V.; Banerjee, A.; Kumar, V., Anomaly detection, ACM Comput. Surv., 41, 3, 1-58 (2009)
[18] Chiquet, J.; Mariadassou, M.; Robin, S., Variational inference for probabilistic Poisson PCA, Ann. Appl. Stat., 12, 4, 2674-2698 (2018) · Zbl 1412.62194
[19] Coretto, P.; Hennig, C., Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for Robust Gaussian clustering, J. Am. Stat. Assoc., 111, 516, 1648-1659 (2016)
[20] Day, NE, Estimating the components of a mixture of normal distributions, Biometrika, 56, 3, 463-474 (1969) · Zbl 0183.48106
[21] Dean, N.; Murphy, TB; Downey, G., Using unlabelled data to update classification rules with applications in food authenticity studies, J. R. Stat. Soc. Ser. C Appl. Stat., 55, 1, 1-14 (2006) · Zbl 05188723
[22] Dempster, A.; Laird, N.; Rubin, D., Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc., 39, 1, 1-38 (1977) · Zbl 0364.62022
[23] Evangelista, PF; Embrechts, MJ; Szymanski, BK, Taming the curse of dimensionality in kernels and novelty detection, Adv. Soft Comput., 34, 425-438 (2006)
[24] Fop, M., Mattei, P.A., Murphy, T.B., Bouveyron, C.: (2018) Unobserved classes and extra variables in high-dimensional discriminant analysis. In: CASI 2018 Conference proceeding, pp. 70-72
[25] Fraley, C.; Raftery, AE, Model-based clustering, discriminant analysis, and density estimation, J. Am. Stat. Assoc., 97, 458, 611-631 (2002) · Zbl 1073.62545
[26] Gallegos, MT; Ritter, G., Using combinatorial optimization in model-based trimmed clustering with cardinality constraints, Comput. Stat. Data Anal., 54, 3, 637-654 (2010) · Zbl 1464.62075
[27] García-Escudero, L.; Gordaliza, A.; Mayo-Iscar, A.; San Martín, R., Robust clusterwise linear regression through trimming, Comput. Stat. Data Anal., 54, 12, 3057-3069 (2010) · Zbl 1284.62198
[28] García-Escudero, LA; Gordaliza, A.; Matrán, C.; Mayo-Iscar, A., A general trimming approach to robust cluster analysis, Ann. Stat., 36, 3, 1324-1345 (2008) · Zbl 1360.62328
[29] García-Escudero, LA; Gordaliza, A.; Mayo-Iscar, A., A constrained robust proposal for mixture modeling avoiding spurious solutions, Adv. Data Anal. Classif., 8, 1, 27-43 (2014) · Zbl 1459.62110
[30] García-Escudero, LA; Gordaliza, A.; Matrán, C.; Mayo-Iscar, A., Avoiding spurious local maximizers in mixture modeling, Stat. Comput., 25, 3, 619-633 (2015) · Zbl 1331.62100
[31] García-Escudero, LA; Gordaliza, A.; Greselin, F.; Ingrassia, S.; Mayo-Iscar, A., The joint role of trimming and constraints in robust estimation for mixtures of Gaussian factor analyzers, Comput. Stat. Data Anal., 99, 131-147 (2016) · Zbl 1468.62060
[32] García-Escudero, LA; Gordaliza, A.; Greselin, F.; Ingrassia, S.; Mayo-Iscar, A., Robust estimation of mixtures of regressions with random covariates, via trimming and constraints, Stat. Comput., 27, 2, 377-402 (2017) · Zbl 06697663
[33] García-Escudero, LA; Gordaliza, A.; Greselin, F.; Ingrassia, S.; Mayo-Iscar, A., Eigenvalues and constraints in mixture modeling: geometric and computational issues, Adv. Data Anal. Classif., 12, 2, 203-233 (2018) · Zbl 1414.62071
[34] García-Escudero, LA; Gordaliza, A.; Matrán, C.; Mayo-Iscar, A., Comments on “The power of monitoring: how to make the most of a contaminated multivariate sample”, Stat. Methods Appl., 27, 4, 661-666 (2018) · Zbl 1428.62226
[35] Gordaliza, A., Best approximations to random variables based on trimming procedures, J. Approx. Theory, 64, 2, 162-180 (1991) · Zbl 0745.41030
[36] Greco, L.; Agostinelli, C., Weighted likelihood mixture modeling and model-based clustering, Stat. Comput. (2019) · Zbl 1436.62255
[37] Greselin, F.; Punzo, A., Closed likelihood ratio testing procedures to assess similarity of covariance matrices, Am. Stat., 67, 3, 117-128 (2013)
[38] Hawkins, DM; McLachlan, GJ, High-breakdown linear discriminant analysis, J. Am. Stat. Assoc., 92, 437, 136 (1997) · Zbl 0889.62052
[39] Hawkins, D.M., Liu, L., Young, S.S.: (2001) Robust singular value decomposition. National Institute of Statistical Science Technical Report 122
[40] Hickey, RJ, Noise modelling and evaluating learning from examples, Artif. Intell., 82, 1-2, 157-179 (1996)
[41] Hubert, M.; Rousseeuw, PJ; Vanden Branden, K., ROBPCA: a new approach to robust principal component analysis, Technometrics, 47, 1, 64-79 (2005)
[42] Ingrassia, S., A likelihood-based constrained algorithm for multivariate normal mixture models, Stat. Methods Appl., 13, 2, 151-166 (2004) · Zbl 1205.62066
[43] Ingrassia, S.; Rocci, R., Degeneracy of the EM algorithm for the MLE of multivariate Gaussian mixtures and dynamic constraints, Comput. Stat. Data Anal., 55, 4, 1715-1725 (2011) · Zbl 1328.65030
[44] Kasabov, N., Pang, S.: (2003) Transductive support vector machines and applications in bioinformatics for promoter recognition. In: International Conference on Neural Networks and Signal Processing, 2003. Proceedings of the 2003, IEEE, vol 1, pp 1-6. doi:10.1109/ICNNSP.2003.1279199, http://ieeexplore.ieee.org/document/1279199/
[45] Li, M.; Xiang, S.; Yao, W., Robust estimation of the number of components for mixtures of linear regression models, Comput. Stat., 31, 4, 1539-1555 (2016) · Zbl 1348.65032
[46] Markou, M.; Singh, S., Novelty detection: a review-part 1: statistical approaches, Signal Process., 83, 12, 2481-2497 (2003) · Zbl 1145.94402
[47] Mclachlan, GJ; Rathnayake, S., On the number of components in a Gaussian mixture model, Wiley Interdiscip. Rev. Data Min. Knowl. Discov., 4, 5, 341-355 (2014)
[48] McNicholas, P.; Murphy, T.; McDaid, A.; Frost, D., Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models, Comput. Stat. Data Anal., 54, 3, 711-723 (2010) · Zbl 1464.62131
[49] Mezzasalma, V.; Sandionigi, A.; Bruni, I.; Bruno, A.; Lovicu, G.; Casiraghi, M.; Labra, M., Grape microbiome as a reliable and persistent signature of field origin and environmental conditions in Cannonau wine production, PLOS ONE, 12, 9, e0184615 (2017)
[50] Mezzasalma, V.; Sandionigi, A.; Guzzetti, L.; Galimberti, A.; Grando, MS; Tardaguila, J.; Labra, M., Geographical and cultivar features differentiate grape microbiota in northern Italy and Spain Vineyards, Front. Microbiol., 9, MAY, 1-13 (2018)
[51] Mitchell, TM, Machine Learning (1997), New York: McGraw-Hill Inc, New York · Zbl 0913.68167
[52] Neykov, NM; Filzmoser, P.; Dimova, RI; Neytchev, PN, Robust fitting of mixtures using the trimmed likelihood estimator, Comput. Stat Data Anal., 52, 1, 299-308 (2007) · Zbl 1328.62033
[53] Nguyen, MH; de la Torre, F., Optimal feature selection for support vector machines, Pattern Recognit., 43, 3, 584-591 (2010) · Zbl 1187.68411
[54] Peel, D.; McLachlan, GJ, Robust mixture modelling using the t distribution, Stat. Comput., 10, 4, 339-348 (2000)
[55] Quionero-Candela, J.; Sugiyama, M.; Schwaighofer, A.; Lawrence, ND, Dataset Shift in Machine Learning (2009), Cambridge: The MIT Press, Cambridge
[56] Team, R.C.: (2018) R: A Language and Environment for Statistical Computing. https://www.r-project.org/
[57] Rand, WM, Objective criteria for the evaluation of clustering methods, J. Am. Stat. Assoc., 66, 336, 846 (1971)
[58] Rousseeuw, PJ; Driessen, KV, A fast algorithm for the minimum covariance determinant estimator, Technometrics, 41, 3, 212-223 (1999)
[59] Schölkopf, B.; Williamson, R.; Smola, A.; Shawe-Taylor, J.; Platt, J., Support vector method for novelty detection, Adv. Neural Inf. Process. Syst., 12, 582-588 (2000)
[60] Schwarz, G., Estimating the dimension of a model, Ann. Stat., 6, 2, 461-464 (1978) · Zbl 0379.62005
[61] Pang, S., Kasabov, N.: (2004) Inductive vs transductive inference, global vs local models: SVM, TSVM, and SVMT for gene expression classification problems. In: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541), IEEE, vol 2, pp 1197-1202, doi:10.1109/IJCNN.2004.1380112, http://ieeexplore.ieee.org/document/1380112/
[62] Tax, DMJ; Duin, RPW; Amin, A.; Dori, D.; Pudil, P.; Freeman, H., Outlier detection using classifier instability, Advances in Pattern Recognition, 593-601 (1998), Berlin: Springer, Berlin
[63] Todorov, V., Filzmoser, P.: An object-oriented framework for Robust multivariate analysis. J. Stat. Softw. 32(3), 1-47 (2009). doi:10.18637/jss.v032.i03
[64] Vanden Branden, K.; Hubert, M., Robust classification in high dimensions based on the SIMCA Method, Chemom. Intell. Lab. Syst., 79, 1-2, 10-21 (2005)
[65] Vapnik, VN, The Nature of Statistical Learning Theory (2000), New York: Springer, New York · Zbl 0934.62009
[66] Waldron, L., Data and statistical methods to analyze the human microbiome, mSystems, 3, 2, 1-4 (2018)
[67] Zhu, X.; Wu, X., Class noise vs. attribute noise: a quantitative study, Artif. Intell. Rev., 22, 3, 177-210 (2004) · Zbl 1069.68587
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.