Unifying data units and models in (co-)clustering. (English) Zbl 1459.62105

Summary: Statisticians are already aware that any task (exploration, prediction) involving a modeling process is largely dependent on the measurement units for the data, to the extent that it should be impossible to provide a statistical outcome without specifying the couple (unit, model). In this work, this general principle is formalized with a particular focus on model-based clustering and co-clustering in the case of possibly mixed data types (continuous and/or categorical and/or counting features), and this opportunity is used to revisit what the related data units are. Such a formalization allows us to raise three important spots: (i) the couple (unit, model) is not identifiable so that different interpretations unit/model of the same whole modeling process are always possible; (ii) combining different “classical” units with different “classical” models should be an interesting opportunity for a cheap, wide and meaningful expansion of the whole modeling process family designed by the couple (unit, model); (iii) if necessary, this couple, up to the non-identifiability property, could be selected by any traditional model selection criterion. Some experiments on real data sets illustrate in detail practical benefits arising from the previous three spots.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI HAL


[1] Andrews DF, Herzberg AM (1985) Data: a collection of problems from many. Fields for the student and research worker. Springer, Berlin · Zbl 0567.62002
[2] Andrews, JL; Mcnicholas, PD, Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions, Stat Comput, 22, 1021-1029, (2012) · Zbl 1252.62062
[3] Atkinson, A.; Riani, M., Exploratory tools for clustering multivariate data, Comput Stat Data Anal, 52, 272-285, (2007) · Zbl 1452.62028
[4] Banfield, JD; Raftery, AE, Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 803-821, (1993) · Zbl 0794.62034
[5] Bertrand F, Droesbeke J-J, Saporta G, Thomas-Agnan C (2017) Model choice and model aggregation. Technip, Paris
[6] Bhatia, P.; Iovleff, S.; Govaert, G., Blockcluster: an R package for model based co-clustering, J Stat Softw, 76, 1-24, (2015)
[7] Biernacki, C.; Celeux, G.; Govaert, G., Assessing a mixture model for clustering with the integrated completed likelihood, IEEE Trans Pattern Anal Mach Intell, 22, 719-725, (2000)
[8] Biernacki, C.; Jacques, J., A generative model for rank data based on insertion sort algorithm, Comput Stat Data Anal, 58, 162-176, (2013) · Zbl 1365.62167
[9] Biernacki, C.; Jacques, J., Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm, Stat Comput, 26, 929-943, (2016) · Zbl 06652986
[10] Biernacki, C.; Lourme, A., Stable and visualizable Gaussian parsimonious clustering models, Stat Comput, 24, 953-969, (2014) · Zbl 1332.62199
[11] Bock H (1981) Statistical testing and evaluation methods in cluster analysis. In: Proceedings of the Indian Statistical Institute golden jubilee international conference on statistics: applications and new directions, Calcutta, pp 116-146
[12] Byar, D.; Green, S., The choice of treatment for cancer patients based on covariate information: application to prostate cancer, Bull Cancer, 67, 477-490, (1980)
[13] Celeux, G.; Diebolt, J., The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem, Comput Stat Q, 2, 73-92, (1985)
[14] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Pattern Recogn, 28, 781-793, (1995)
[15] Dempster, AP; Laird, NM; Rubin, DB, Maximum likelihood from incomplete data (with discussion), J R Stat Soc B, 39, 1-38, (1977) · Zbl 0364.62022
[16] Gallopin M, Rau A, Celeux G, Jaffrézic F (2015) Transformation des données et comparaison de modèles pour la classification des données rna-seq. 47èmes Journées de Statistique de la SFdS
[17] Ghahramani Z, Hinton G (1997) The EM algorithm for factor analyzers. Technical report, University of Toronto
[18] Goodman, LA, Exploratory latent structure models using both identifiable and unidentifiable models, Biometrika, 61, 215-231, (1974) · Zbl 0281.62057
[19] Govaert G (2009) Data analysis. ISTE-Wiley, Hoboken · Zbl 1328.62024
[20] Govaert G, Nadif M (2013) Co-clustering. Wiley, Hoboken · Zbl 1416.62309
[21] Hilbe JM (2014) Modeling count data. Cambridge University Press, Cambridge
[22] Hunt, L.; Jorgensen, M., Mixture model clustering: a brief introduction to the multimix program, Aust N Z J Stat, 41, 153-171, (1999) · Zbl 0962.62061
[23] Jain, AK, Data clustering: 50 years beyond k-means, Pattern Recogn Lett, 31, 651-666, (2010)
[24] Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, New Jersey · Zbl 0665.62061
[25] Jorgensen M, Hunt L (1996) Mixture model clustering of data sets with categorical and continuous variables. In: Proceedings of the conference ISIS, pp 375-384
[26] Keribin, C.; Brault, V.; Celeux, G.; Govaert, G., Estimation and selection for the latent block model on categorical data, Stat Comput, 25, 1201-1216, (2015) · Zbl 1331.62149
[27] Krantz DH, Luce RD, Suppes P, Tversky A (1971) Foundations of measurement (additive and polynomial representations), vol 1. Academic Press, New York · Zbl 0232.02040
[28] Law, MH; Figueiredo, MAT; Jain, AK, Simultaneous feature selection and clustering using mixture models, IEEE Trans Pattern Anal Mach Intell, 26, 1154-1166, (2004)
[29] Lebret, R.; Iovleff, S.; Langrognet, F.; Biernacki, C.; Celeux, G.; Govaert, G., Rmixmod: the R package of the model-based unsupervised, supervised and semi-supervised classification mixmod library, J Stat Softw, 64, 241-270, (2015)
[30] Lee S, McLachlan G (2013) Emmixuskew: fitting unrestricted multivariate skew t mixture models. R package version 0.11-5
[31] Little RJ A, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken · Zbl 1011.62004
[32] Lomet A, Govaert G, Grandvalet Y (2012) Model selection in block clustering by the integrated classification likelihood. In: 20th International conference on computational statistics (COMPSTAT 2012), Lymassol, France, pp 519-530 · Zbl 1416.62349
[33] Luce RD, Krantz DH, Suppes P, Tversky A (1990) Foundations of measurement, vol 3. Academic Press, New York · Zbl 0749.03001
[34] Manly, BF, Exponential data transformations, Statistician, 25, 37-42, (1976)
[35] Marbac M, Sedki M (2015) Variable selection for model-based clustering using the integrated complete-data likelihood. arXiv:1501.06314 · Zbl 1384.62199
[36] Maugis, C.; Celeux, G.; Martin-Magniette, M., Variable selection for clustering with Gaussian mixture models, Biometrics, 65, 701-709, (2009) · Zbl 1172.62021
[37] Maugis, C.; Celeux, G.; Martin-Magniette, M-L, Variable selection in model-based clustering: a general variable role modeling, Comput Stat Data Anal, 53, 3872-3882, (2009) · Zbl 1453.62154
[38] McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York · Zbl 0963.62061
[39] McLachlan, G.; Peel, D., Modelling high-dimensional data by mixtures of factor analyzers, Comput Stat Data Anal, 41, 379-388, (2003) · Zbl 1256.62036
[40] McNicholas, P.; Murphy, T., Model-based clustering of microarray expression data via latent gaussian mixture models, Bioinformatics, 21, 2705-2712, (2010)
[41] McNicholas PD (2016) Mixture model-based classification. Chapman and Hall, New York · Zbl 1454.62005
[42] McParland, D.; Gormley, IC, Model based clustering for mixed data: clustMD, Adv Data Anal Classif, 10, 155-169, (2016)
[43] Melnykov, V.; Maitra, R., Finite mixture models and model-based clustering, Stat Surv, 4, 80-116, (2010) · Zbl 1190.62121
[44] Meynet C (2012) Sélection de variables pour la classification non supervisée en grande dimension. Ph.D. thesis, Université Paris-Sud 11
[45] Meynet C, Maugis-Rabusseau C (2012) A sparse variable selection procedure in model-based clustering. Research report
[46] Moustaki, I.; Papageorgiou, I., Latent class models for mixed variables with applications in archaeometry, Comput Stat Data Anal, 48, 65-675, (2005) · Zbl 1430.62254
[47] Pan, W.; Shen, X., Penalized model-based clustering with application to variable selection, J Mach Learn Res, 8, 1145-1164, (2007) · Zbl 1222.68279
[48] Prates, MO; Lachos, VH; Cabral, C., mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions, J Stat Softw, 54, 1-20, (2013)
[49] Raftery, AE; Dean, N., Variable selection for model-based clustering, J Am Stat Assoc, 101, 168-178, (2006) · Zbl 1118.62339
[50] Rand, WM, Objective criteria for the evaluation of clustering methods, J Am Stat Assoc, 66, 846-850, (1971)
[51] Rao CR, Miller JP, Rao DC (2007) Handbook of statistics: epidemiology and medical statistics, vol 27. Elsevier, New York · Zbl 1359.62021
[52] Rau, A.; Maugis-Rabusseau, C., Transformation and model choice for RNA-seq co-expression analysis, Brief Bioinform, 19, 425-436, (2018)
[53] Rau, A.; Maugis-Rabusseau, C.; Martin-Magniette, M-L; Celeux, G., Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models, Bioinformatics, 31, 1420-1427, (2015)
[54] Redner, R.; Walker, H., Mixture densities, maximum likelihood and the EM algorithm, SIAM Rev, 26, 195-239, (1984) · Zbl 0536.62021
[55] Schlimmer JC (1987) Concept acquisition through representational adjustment. Ph.D. thesis, Department of Information and Computer Science, University of California, Irvine, CA
[56] Schwarz, G., Estimating the dimension of a model, Ann Stat, 6, 461-464, (1978) · Zbl 0379.62005
[57] Seber GAF, Lee AJ (2012) Linear regression analysis, 2nd edn. Wiley, New Jersey · Zbl 1029.62059
[58] Sedki M, Celeux G, Maugis-Rabusseau C (2014) SelvarMix: a R package for variable selection in model-based clustering and discriminant analysis with a regularization approach. Research report
[59] Suppes P, Krantz DH, Luce RD, Tversky A (1989) Foundations of measurement, vol 2. Academic Press, New York · Zbl 0719.03003
[60] Tadesse, MG; Sha, N.; Vannucci, M., Bayesian variable selection in clustering high-dimensional data, J Am Stat Assoc, 100, 602-617, (2005) · Zbl 1117.62433
[61] Thomas, I.; Frankhauser, P.; Biernacki, C., The morphology of built-up landscapes in Wallonia (Belgium): a classification using fractal indices, Landsc Urban Plan, 84, 99-115, (2008)
[62] Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York · Zbl 1006.62003
[63] Wang K, McLachlan GJ, Ng SK, Peel D (2012) EMMIX-skew: EM Algorithm for Mixture of Multivariate Skew Normal/t Distributions. R code version 1.0.16. http://www.maths.uq.edu.au/ gjm/mix_soft/EMMIX-skew
[64] Wolfe JH (1971) A monte carlo study of the sampling distribution of the likelihood ratio for mixtures of multinormal distributions. Technical Bulletin STB 72-2, US Naval Personnel Research Activity, San Diego, CA
[65] Yeung, K.; Fraley, C.; Murua, A.; Raftery, A.; Ruzzo, W., Model-based clustering and data transformations for gene expression data, Bioinformatics, 17, 977-987, (2001)
[66] Zhou, H.; Pan, W.; Shen, X., Penalized model-based clustering with unconstrained covariance matrices, Electron J Stat, 3, 1473-1496, (2009) · Zbl 1326.62143
[67] Zhu X, Melnykov V (2016) Manly transformation in finite mixture modeling. Comput Stat Data Anal 121:190-208 · Zbl 1469.62184
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.