×

Robust probabilistic PCA with missing data and contribution analysis for outlier detection. (English) Zbl 1453.62067

Summary: Principal component analysis (PCA) is a widely adopted multivariate data analysis technique, with interpretation being established on the basis of both classical linear projection and a probability model (i.e. probabilistic PCA (PPCA)). Recently robust PPCA models, by using the multivariate \(t\)-distribution, have been proposed to consider the situation where there may be outliers within the data set. This paper presents an overview of the robust PPCA technique, and further discusses the issue of missing data. An expectation-maximization (EM) algorithm is presented for the maximum likelihood estimation of the model parameters in the presence of missing data. When applying robust PPCA for outlier detection, a contribution analysis method is proposed to identify which variables contribute the most to the occurrence of outliers, providing valuable information regarding the source of outlying data. The proposed technique is demonstrated on numerical examples, and the application to outlier detection and diagnosis in an industrial fermentation process.

MSC:

62-08 Computational methods for problems pertaining to statistics
62F35 Robustness and adaptive procedures (parametric inference)
62H25 Factor analysis and principal components; correspondence analysis

Software:

BayesDA; ROBPCA
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Archambeau, C., Delanney, N., Verleysen, M., 2006. Robust probabilistic projection. In: Proc. 23rd International Conference on Machine Learning, Pittsburgh, USA; Archambeau, C., Delanney, N., Verleysen, M., 2006. Robust probabilistic projection. In: Proc. 23rd International Conference on Machine Learning, Pittsburgh, USA
[2] Atkinson, A. C.; Riani, M.; Cerioli, A., Exploring Multivariate Data with the Forward Search (2004), Springer-Verlag: Springer-Verlag New York · Zbl 1049.62057
[3] Barnett, V.; Lewis, T., Outliers in Statistical Data (1994), John Wiley: John Wiley New York · Zbl 0801.62001
[4] Basabe, X.L., 2004. Towards improved fermentation consistency using multivariate analysis of process data. Master’s Thesis, University of Newcastle upon Tyne, UK; Basabe, X.L., 2004. Towards improved fermentation consistency using multivariate analysis of process data. Master’s Thesis, University of Newcastle upon Tyne, UK
[5] Cambell, N. A., Robust procedures in multivariate analysis, Applied Statistics, 29, 231-237 (1980) · Zbl 0471.62047
[6] Chen, T.; Morris, J.; Martin, E., Probability density estimation via an infinite Gaussian mixture model: Application to statistical process monitoring, Journal of the Royal Statistical Society C (Applied Statistics), 55, 699-715 (2006) · Zbl 1109.62123
[7] Chen, T.; Morris, J.; Martin, E., Dynamic data rectification using particle filters, Computers and Chemical Engineering, 32, 451-462 (2008)
[8] Chen, T.; Sun, Y., Probabilistic contribution analysis for statistical process monitoring: A missing variable approach, Control Engineering Practice, 17, 469-477 (2009)
[9] Croux, C.; Haesbroeck, G., Principal components analysis based on robust estimators of the covariance or correlation matrix: Influence functions and efficiencies, Biometrika, 87, 603-618 (2000) · Zbl 0956.62047
[10] Daszykowski, M.; Kaczmarek, K.; Heyden, Y. V.; Walczak, B., Robust statistics in data analysis — A review basic concepts, Chemometrics and Intelligent Laboratory Systems, 85, 203-219 (2007)
[11] Dempster, A. P.; Laird, N. M.; Rubin, D. B., Maximum likelihood from incomplete data via the EM algorithm, Journal of Royal Statistical Society B, 39, 1-38 (1977) · Zbl 0364.62022
[12] Devlin, S. J.; Gnanadesikan, R.; Kettenring, J. R., Robust estimation of dispersion matrices and principal component, Journal of the American Statistical Association, 12, 136-154 (1981) · Zbl 0463.62031
[13] Dunia, R.; Qin, S.; Edgar, T.; McAvoy, T., Identification of faulty sensors using PCA, AIChE Journal, 42, 2797-2812 (1996)
[14] Fang, Y.; Jeong, M. K., Robust probabilistic multivariate calibration model, Technometrics, 50, 305-316 (2008)
[15] Gelman, A. B.; Carlin, J. S.; Stern, H. S.; Rubin, D. B., Bayesian Data Analysis (1995), Chapman & Hall/CRC
[16] Hardin, J.; Rocke, D. M., The distribution of robust distances, Journal of Computational and Graphical Statistics, 14, 910-927 (2005)
[17] Huber, P. J., Robust Statistics (1981), Wiley: Wiley New York · Zbl 0536.62025
[18] Hubert, M.; Rousseeuw, P. J.; Branden, K. V., ROBPCA: A new approach to robust principal component analysis, Technometrics, 47, 64-79 (2005)
[19] Hubert, M.; Rousseeuw, P. J.; Verboven, S., A fast method for robust principal components with applications to chemometrics, Chemometrics and Intelligent Laboratory Systems, 60, 101-111 (2002)
[20] Ibazizen, M.; Dauxois, J., A robust principal component analysis, Statistics, 37, 73-83 (2003) · Zbl 1013.62068
[21] Jolliffe, I. T., Principal Component Analysis (2002), Springer · Zbl 1011.62064
[22] Kim, D.; Lee, I.-B., Process monitoring based on probabilistic PCA, Chemometrics and Intelligent Laboratory Systems, 67, 109-123 (2003)
[23] Kotz, S.; Nadarajah, S., Multivariate \(t\) Distributions and Their Applications (2004), Cambridge University Press · Zbl 1100.62059
[24] Lange, K. L.; Little, R. J.A.; Taylor, J. M.G., Robust statistical modeling using the \(t\) distribution, Journal of the American Statistical Association, 84, 881-896 (1989)
[25] Li, G.; Chen, Z., Projection-pursuit approach to robust dispersion matrices and principal components: Primary theory and Monte Carlo, Journal of the American Statistical Association, 80, 759-766 (1985) · Zbl 0595.62060
[26] Little, R. J.A.; Rubin, D. B., Statistical Analysis with Missing Data (1987), Wiley: Wiley Chichester · Zbl 0665.62004
[27] Liu, C., ML estimation of the multivariate \(t\) distribution and the EM algorithm, Journal of Multivariate Analysis, 63, 296-312 (1997) · Zbl 0884.62059
[28] Miller, P.; Swanson, R. E.; Heckler, C. F., Contribution plots: A missing link in multivariate quality control, International Journal of Applied Mathematics and Computer Science, 8, 775-792 (1998) · Zbl 0925.93034
[29] Peel, D.; McLachlan, G. J., Robust mixture modelling using the \(t\) distribution, Statistics and Computing, 10, 339-348 (2000)
[30] Qin, S. J., Statistical process monitoring: Basics and beyond, Journal of Chemometrics, 17, 480-502 (2003)
[31] Rocke, D. M.; Woodruff, D. L., Multivariate outlier detection and robust covariance matrix estimation — Discussion, Technometrics, 43, 300-303 (2001)
[32] Rousseeuw, P. J.; van Driessen, K., A fast algorithm for the minimum covariance determinant estimator, Technometrics, 41, 212-223 (1999)
[33] Ruymagaart, F. H., A robust principal component analysis, Journal of Multivariate Analysis, 11, 485-497 (1981) · Zbl 0539.62063
[34] Schick, I. C.; Mitter, S. K., Robust recursive estimation in the presence of heavy-tailed observation noise, Annals of Statistics, 22, 1045-1080 (1994) · Zbl 0815.62014
[35] Tipping, M. E.; Bishop, C. M., Mixtures of probabilistic principal component analysers, Neural Computation, 11, 443-482 (1999)
[36] Tipping, M. E.; Bishop, C. M., Probabilistic principal component analysis, Journal of the Royal Statistical Society B, 61, 611-622 (1999) · Zbl 0924.62068
[37] Wilks, S., Mathematical Statistics (1962), Wiley: Wiley New York · Zbl 0173.45805
[38] Yue, H.; Qin, S., Reconstruction based fault identification using a combined index, Industrial and Engineering Chemistry Research, 40, 4403-4414 (2001)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.