×

Robust archetypoids for anomaly detection in big functional data. (English) Zbl 07363880

Summary: Archetypoid analysis (ADA) has proven to be a successful unsupervised statistical technique to identify extreme observations in the periphery of the data cloud, both in classical multivariate data and functional data. However, two questions remain open in this field: the use of ADA for outlier detection and its scalability. We propose to use robust functional archetypoids and adjusted boxplot to pinpoint functional outliers. Furthermore, we present a new archetypoid algorithm for obtaining results from large data sets in reasonable time. Functional time series are occurring in many practical problems, so this paper focuses on functional data settings. The new algorithm for detecting functional anomalies, called CRO-FADALARA, can be used with both univariate and multivariate curves. Our proposal for outlier detection is compared with all the state-of-the-art methods in a controlled study, showing a good performance. Furthermore, CRO-FADALARA is applied to two large time series data sets, where outliers curves are discussed and the reduction in computational time is clearly stated. A third case study with a small ECG data set is discussed, given its importance in functional data scenarios. All data, R code and a new R package are freely available.

MSC:

62P30 Applications of statistics in engineering and industry; control charts
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Alcacer, A.; Epifanio, I.; Ibáñez, M.; Simó, A.; Ballester, A., A data-driven classification of 3D foot types by archetypal shapes based on landmarks, PLoS ONE, 15, 1, e0228016 (2020) · doi:10.1371/journal.pone.0228016
[2] Arribas-Gil, A.; Romo, J., Shape outlier detection and visualization for functional data: the outliergram, Biostatistics, 15, 4, 603-619 (2014) · doi:10.1093/biostatistics/kxu006
[3] Azcorra, A.; Chiroque, L.; Cuevas, R.; Fernández Anta, A.; Laniado, H.; Lillo, R.; Romo, J.; Sguera, C., Unsupervised scalable statistical method for identifying influential users in online social networks, Sci Rep, 8, 1-7 (2018) · doi:10.1038/s41598-018-24874-2
[4] Bagnall A, Lines J, Vickers W, Keogh E (2018) The UEA & UCR time series classification repository. www.timeseriesclassification.com
[5] Beaton, A.; Tukey, J., The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data, Technometrics, 16, 2, 147-185 (1974) · Zbl 0282.62057 · doi:10.1080/00401706.1974.10489171
[6] Cabero, I.; Epifanio, I., Archetypal analysis: an alternative to clustering for unsupervised texture segmentation, Image Anal Stereol, 38, 151-160 (2019) · Zbl 1419.94003 · doi:10.5566/ias.2052
[7] Cabero I, Epifanio I (2020) Finding archetypal patterns for binary questionnaires. SORT 44(1) (in press). arXiv:2003.00043 · Zbl 1442.62142
[8] Chang W, Cheng J, JJ A, Xie Y, McPherson J (2017) Shiny: web application framework for R. https://CRAN.R-project.org/package=shiny. R package version 1.0.5
[9] Chen Y, Mairal J, Harchaoui Z (2014) Fast and robust archetypal analysis for representation learning. In: CVPR 2014—IEEE conference on computer vision and pattern recognition, pp 1478-1485. doi:10.1109/CVPR.2014.192
[10] Cutler, A.; Breiman, L., Archetypal analysis, Technometrics, 36, 4, 338-347 (1994) · Zbl 0804.62002 · doi:10.2307/1269949
[11] D’Orazio M (2018) univOutl: detection of univariate outliers. https://CRAN.R-project.org/package=univOutl. R package version 0.1-4
[12] Dua D, Karra-Taniskidou E (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml
[13] Epifanio, I., Functional archetype and archetypoid analysis, Comput Stat Data Anal, 104, 24-34 (2016) · Zbl 1466.62062 · doi:10.1016/j.csda.2016.06.007
[14] Epifanio, I.; Ibáñez, M.; Simó, A., Archetypal shapes based on landmarks and extension to handle missing data, Adv Data Anal Classif, 12, 705-735 (2018) · Zbl 1416.62326 · doi:10.1007/s11634-017-0297-7
[15] Epifanio, I.; Ibáñez, M.; Simó, A., Archetypal analysis with missing data: see all samples by looking at a few based on extreme profiles, Am Stat, 72, 169-183 (2020) · Zbl 07593671 · doi:10.1080/00031305.2018.1545700
[16] Eugster, M.; Leisch, F., Weighted and robust archetypal analysis, Comput Stat Data Anal, 55, 1215-1225 (2011) · Zbl 1328.65027 · doi:10.1016/j.csda.2010.10.017
[17] Febrero, M.; Galeano, P.; González-Manteiga, W., A functional analysis of \(NO_x\) levels: location and scale estimation and outlier detection, Comput Stat, 22, 3, 411-427 (2007) · Zbl 1197.62154 · doi:10.1007/s00180-007-0048-x
[18] Febrero, M.; Galeano, P.; González-Manteiga, W., Outlier detection in functional data by depth measures, with application to identify abnormal \(NO_x\) levels, Environmetrics, 19, 331-345 (2008) · doi:10.1002/env.878
[19] Febrero-Bande, M.; Oviedo de la Fuente, M., Statistical computing in functional data analysis: the R package fda.usc, J Stat Softw, 51, 4, 1-28 (2012) · doi:10.18637/jss.v051.i04
[20] Fraiman, R.; Svarc, M., Resistant estimates for high dimensional and functional data based on random projections, Comput Stat Data Anal, 58, 326-338 (2013) · Zbl 1365.62200 · doi:10.1016/j.csda.2012.09.006
[21] Hubert, M.; Rousseeuw, P.; Segaert, P., Multivariate functional outlier detection, Stat Methods Appl, 24, 2, 177-202 (2015) · Zbl 1441.62124 · doi:10.1007/s10260-015-0297-8
[22] Hubert, M.; Rousseeuw, P.; Segaert, P., Multivariate and functional classification using depth and distance, Adv Data Anal Classif, 11, 445-466 (2017) · Zbl 1414.62247 · doi:10.1007/s11634-016-0269-3
[23] Hyndman, R.; Shahid Ullah, M., Robust forecasting of mortality and fertility rates: a functional data approach, Comput Stat Data Anal, 51, 10, 4942-4956 (2007) · Zbl 1162.62434 · doi:10.1016/j.csda.2006.07.028
[24] Hubert, M.; Vandervieren, E., An adjusted boxplot for skewed distributions, Comput Stat Data Anal, 52, 5186-5201 (2008) · Zbl 1452.62074 · doi:10.1016/j.csda.2007.11.008
[25] Hyndman, R., Rainbow plots, bagplots, and boxplots for functional data, J Comput Graph Stat, 19, 1, 29-45 (2010) · doi:10.1198/jcgs.2009.08158
[26] Kaufman, L.; Rousseeuw, P., Finding groups in data, an introduction to cluster analysis (1990), New York: Wiley, New York · Zbl 1345.62009
[27] Mair S, Boubekki A, Brefeld U (2017) Frame-based data factorizations. In: Proceedings of the 34th international conference on machine learning, Sydney, Australia, pp 2305-2313. http://proceedings.mlr.press/v70/mair17a/mair17a.pdf
[28] Millán-Roures, L.; Epifanio, I.; Martínez, V., Detection of anomalies in water networks by functional data analysis, Math Probl Eng, 2018, 1-14 (2018) · doi:10.1155/2018/5129735
[29] Moliner, J.; Epifanio, I., Robust multivariate and functional archetypal analysis with application to financial time series analysis, Physica A Stat Mech Appl, 519, 195-208 (2019) · Zbl 1514.62080 · doi:10.1016/j.physa.2018.12.036
[30] Ooi H (2017) Microsoft Corporation, Weston, S., Tenenbaum, D.: doParallel: Foreach Parallel Adaptor for the ‘parallel’ Package. https://CRAN.R-project.org/package=doParallel. R package version 1.0.11
[31] R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
[32] Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: SIGMOD ’00 proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp 427-438. doi:10.1145/342009.335437
[33] Ramsay, JO; Silverman, B., Functional data analysis (2005), Berlin: Springer, Berlin · Zbl 1079.62006 · doi:10.1007/b98888
[34] Ramsay, JO; Hooker, G.; Graves, S., Functional data analysis with R and MATLAB (2009), Berlin: Springer, Berlin · Zbl 1179.62006 · doi:10.1007/978-0-387-98185-7
[35] Ramsay JO, Wickham H, Graves S, Hooker G (2017) FDA: functional data analysis. R package version 2.4.7, https://CRAN.R-project.org/package=fda
[36] Rebbapragada, U.; Protopapas, P.; Brodley, C.; Alcock, C., Finding anomalous periodic time series. An application to catalogs of periodic variable stars, Mach Learn (2009) · Zbl 1470.68162 · doi:10.1007/s10994-008-5093-3
[37] Rodríguez-Luján, I.; Fonollosa, J.; Vergara, A.; Homer, M.; Huerta, R., On the calibration of sensor arrays for pattern recognition using the minimal number of experiments, Chemom Intell Lab Syst, 130, 123-134 (2014) · doi:10.1016/j.chemolab.2013.10.012
[38] Rousseeuw, P.; Leroy, A., Robust regression and outlier detection (1987), New York: Wiley, New York · Zbl 0711.62030 · doi:10.1002/0471725382
[39] Segaert P, Hubert M, Rousseeuw P, Raymaekers J (2017) mrfDepth: depth measures in multivariate, regression and functional settings. R package version 1.0.6. https://CRAN.R-project.org/package=mrfDepth
[40] Shang HL, Hyndman RJ (2016) rainbow: Rainbow Plots, Bagplots and Boxplots for functional data. R package version 3.4. https://CRAN.R-project.org/package=rainbow
[41] Sinova, B.; González Rodríguez, G.; Van Aelst, S., M-estimators of location for functional data, Bernouilli, 24, 3, 2328-2357 (2018) · Zbl 1440.62405 · doi:10.3150/17-BEJ929
[42] Sun, Y.; Genton, M., Functional boxplots, J Comput Graph Stat, 20, 2, 316-334 (2011) · doi:10.1198/jcgs.2011.09224
[43] Sun, W.; Yang, G.; Wu, K.; Li, W.; Zhang, D., Pure endmember extraction using robust kernel archetypoid analysis for hyperspectral imagery, ISPRS J Photogr Remote Sens, 131, 147-159 (2017) · doi:10.1016/j.isprsjprs.2017.08.001
[44] Tarabelloni N, Arribas-Gil A, Ieva F, Paganoni AM, Romo J (2018) roahd: robust analysis of high dimensional data. R package version 1.4, https://CRAN.R-project.org/package=roahd
[45] Vergara, A.; Vembu, S.; Ayhan, T.; Ryan, M.; Homer, M.; Huerta, R., Chemical gas sensor drift compensation using classifier ensembles, Sens Actuators B Chem, 166, 320-329 (2012) · doi:10.1016/j.snb.2012.01.074
[46] Vinué, G.; Epifanio, I.; Alemany, S., Archetypoids: a new approach to define representative archetypal data, Comput Stat Data Anal, 87, 102-115 (2015) · Zbl 1468.62203 · doi:10.1016/j.csda.2015.01.018
[47] Vinué, G.; Epifanio, I., Archetypoid analysis for sports analytics, Data Min Knowl Discov, 31, 6, 1643-1677 (2017) · doi:10.1007/s10618-017-0514-1
[48] Vinué, G., Anthropometry: an R package for analysis of anthropometric data, J Stat Softw, 77, 6, 1-39 (2017) · doi:10.18637/jss.v077.i06
[49] Vinué, G.; Epifanio, I., Forecasting basketball players’ performance using sparse functional data, Stat Anal Data Min ASA Data Sci J, 12, 6, 534-547 (2019) · Zbl 07260658 · doi:10.1002/sam.11436
[50] Young, D., tolerance: An R package for estimating tolerance intervals, J Stat Softw, 36, 5, 1-39 (2010) · doi:10.18637/jss.v036.i05
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.