Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA. (English) Zbl 1273.62137

Summary: Model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, model-based clustering techniques usually perform poorly when dealing with high-dimensional data streams, which are nowadays a frequent data type. To overcome this limitation of model-based clustering, we propose an online inference algorithm for the mixture of probabilistic PCA models. The proposed algorithm relies on an EM-based procedure and on a probabilistic and incremental version of PCA. Model selection is also considered in the online setting through parallel computing. Numerical experiments on simulated and real data demonstrate the effectiveness of our approach and compare it to state-of-the-art online EM-based algorithms.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H25 Factor analysis and principal components; correspondence analysis
65Y05 Parallel numerical computation
65C60 Computational problems in statistics (MSC2010)


Full Text: DOI HAL


[1] Aggarwal C, Han J, Wang J, Yu P (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the 30th International Conference on very large data bases, vol. 30. VLDB Endowment, pp 852–863
[2] Akaike H (1981) Likelihood of a model and information criteria. J Econom 16(1):3–14 · Zbl 0457.62032
[3] Arandjelović O, Cipolla R (2005) Incremental learning of temporally-coherent Gaussian mixture models. In: Proceedings of the British Machine Vision Conference. Oxford, UK, pp 759–768
[4] Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on principles of database systems. ACM, pp 234–243
[5] Baek J, McLachlan G, Flack L (2010) Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualization of high-dimensional data. Pattern Anal Mach Intell IEEE Trans 32(7):1298–1309
[6] Bartholomew D, Knott M, Moustaki I (2011) Latent variable models and factor analysis: a unified approach, vol 899. Wiley, New York · Zbl 1266.62040
[7] Basilevsky A (2009) Statistical factor analysis and related methods: theory and applications, vol 418. Wiley-Interscience, New York · Zbl 1130.62341
[8] Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. Pattern Anal Mach Intell IEEE Trans 22(7):719–725
[9] Bouveyron C, Brunet C (2012) Simultaneous model-based clustering and visualization in the fisher discriminative subspace. Stat Comput 22(1):301–324 · Zbl 1322.62162
[10] Bouveyron C, Girard S, Schmid C (2007a) High-dimensional data clustering. Comput Stat Data Anal 52(1):502–519 · Zbl 1452.62433
[11] Bouveyron C, Girard S, Schmid C (2007b) High-dimensional discriminant analysis. Commun Stat Theory Methods 36(14):2607–2623 · Zbl 1128.62072
[12] Cappé O, Moulines E (2009) Online EM algorithm for latent data models. R Stat Soc: Ser B (Stat Methodol) 71:1–21. http://arxiv.org/pdf/0712.4273 · Zbl 1250.62015
[13] Celeux G, Govaert G (1992) A classification em algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332 · Zbl 0937.62605
[14] Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38. doi: 10.2307/2984875 · Zbl 0364.62022
[15] Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th International Conference on Machine Learning, pp 106–113
[16] Duda R, Har, P, Stork D (1995) Pattern classification and scene analysis, 2nd edn
[17] Figueiredo M, Jain A (2002) Unsupervised learning of finite mixture models. Pattern Anal Mach Intell IEEE Trans 24(3):381–396
[18] Fraley C, Raftery A (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631 · Zbl 1073.62545
[19] Gaber M, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Record 34(2):18–26 · Zbl 1087.68557
[20] Ghahramani Z, Hinton G et al (1996) The em algorithm for mixtures of factor analyzers. Tech. rep., Technical Report CRG-TR-96-1, University of Toronto
[21] Guha S, Mishra N, Motwani R, O’Callaghan L (2000) Clustering data streams. In: Foundations of Computer Science, 2000. In: Proceedings of 41st Annual Symposium on IEEE, pp 359–366
[22] Hall P, Hicks Y, Robinson T (2005) A method to add gaussian mixture models. Technical report, University of Bath
[23] Hall P, Marshall D, Martin R (1998) Incremental eigenanalysis for classification. In: British Machine Vision Conference, vol 1. Citeseer, pp 286–295
[24] Jacques J, Bouveyron C, Girard S, Devos O, Duponchel L, Ruckebusch C (2010) Gaussian mixture models for the classification of high-dimensional vibrational spectroscopy data. J Chemom 24(11–12):719–727
[25] Lindsay B (1995) Mixture models: theory, geometry and applications. In: JSTOR NSF-CBMS Regional Conference Series in probability and statistics. · Zbl 1163.62326
[26] MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on mathematical statistics and probability, vol. 1. California, USA, p 14 · Zbl 0214.46201
[27] McLachlan G, Krishnan T (1997) The em algorithm and extensions. Wiley-Interscience, New York · Zbl 0882.62012
[28] McLachlan G, Peel D (2000) Finite mixture models, vol 299. Wiley-Interscience, New York · Zbl 0963.62061
[29] McLachlan G, Peel D, Bean R (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41(3):379–388 · Zbl 1256.62036
[30] McNicholas P, Murphy B (2008) Parsimonious Gaussian mixture models. Stat Comput 18(3):285–296
[31] McNicholas P, Murphy T, McDaid A, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54(3):711–723 · Zbl 1464.62131
[32] Neal R, Hinton G (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. Learn Graph Models 89:355–368 · Zbl 0916.62019
[33] O’callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of 18th International Conference on Data Engineering, pp 685–694
[34] Samé A, Ambroise C, Govaert G (2007) An online classification EM algorithm based on the mixture model. Stat Comput 17(3):209–218. doi: 10.1007/s11222-007-9017-z
[35] Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464 · Zbl 0379.62005
[36] Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101
[37] Tipping M, Bishop C (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482
[38] Titterington D (1984) Recursive parameter estimation using incomplete data. J R Stat Soc Ser B (Methodol) 46(2):257–267 · Zbl 0556.62061
[39] Ueda N, Nakano R, Ghahramani Z, Hinton G (2000) Smem algorithm for mixture models. Neural Comput 12(9):2109–2128
[40] Wang WL, Lin TI (2013) An efficient ecm algorithm for maximum likelihood estimation in mixtures of t-factor analyzers. Comput Stat 28(2):751–759 · Zbl 1305.65082
[41] Wu C (1983) On the convergence properties of the em algorithm. Ann Stat 11(1):95–103 · Zbl 0517.62035
[42] Zhao JH, Yu PL (2008) Fast ml estimation for the mixture of factor analyzers via an ecm algorithm. Neural Netw IEEE Trans 19(11):1956–1961
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.