×

Model based clustering for mixed data: clustMD. (English) Zbl 1414.62254

Summary: A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T10 Pattern recognition, speech recognition
91C20 Clustering in the social and behavioral sciences
62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

clustMD; bfa; MULTIMIX; mclust; R
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Andrews DA, Herzberg AM (1985) Data: a collection of problems from many fields for the student and research worker. Springer, New York · Zbl 0567.62002
[2] Banfield, JD; Raftery, AE, Model-based clustering and classification of data with mixed type, Biometrics, 49, 803-821, (1993) · Zbl 0794.62034
[3] Browne, RP; McNicholas, PD, Model-based clustering and classification of data with mixed type, J Stat Plan Inference, 142, 2976-2984, (2012) · Zbl 1335.62093
[4] Byar, DP; Green, SB, The choice of treatment for cancer patients based on covariate information: application to prostate cancer, Bull du Cancer, 67, 477-490, (1980)
[5] Cagnone, S.; Viroli, C., A factor mixture analysis model for multivariate binary data, Stat Model, 12, 257-277, (2012)
[6] Cai, JH; Song, XY; Lam, KH; Ip, EHS, A mixture of generalized latent variable models for mixed mode and heterogeneous data, Comput Stat Data Anal, 55, 2889-2907, (2011) · Zbl 1218.62012
[7] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Pattern Recognit, 28, 781-793, (1995)
[8] Dempster, AP; Laird, NM; Rubin, DB, Maximum likelihood from incomplete data via the EM algorithm, J R Stat Soc Ser B (Methodological), 39, 1-38, (1977) · Zbl 0364.62022
[9] Everitt, BS, A finite mixture model for the clustering of mixed-mode data, Stat Probab Lett, 6, 305-309, (1988)
[10] Fox JP (2010) Bayesian Item Response Modeling. Springer, New York · Zbl 1271.62012
[11] Fraley, C.; Raftery, AE, Model-based clustering, discriminant analysis, and density estimation, J Am Stat Assoc, 97, 611-631, (2002) · Zbl 1073.62545
[12] Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, Department of Statistics, University of Washington
[13] Frühwirth-Schnatter S (2006) Finite mixture and markov switching models. Springer, New York · Zbl 1108.62002
[14] Geweke, J.; Keane, M.; Runkle, D., Alternative computational approaches to inference in the multinomial probit model, Rev Econ Stat, 76, 609-632, (1994)
[15] Gollini I, Murphy TB (2014) Mixture of latent trait analyzers for model-based clustering of categorical data. Stat Comput 24(4):569-588 · Zbl 1325.62122
[16] Gruhl, J.; Erosheva, EA; Crane, P., A semiparametric approach to mixed outcome latent variable models: Estimating the association between cognition and regional brain volumes, Ann Appl Stat, 7, 2361-2383, (2013) · Zbl 1283.62218
[17] Hunt, L.; Jorgensen, M., Mixture model clustering using the multimix program, Aust N Z J Stat, 41, 153-171, (1999) · Zbl 0962.62061
[18] Johnson VE, Albert JH (1999) Ordinal data modeling. Springer, New York · Zbl 0921.62141
[19] Karlis, D.; Santourian, A., Model-based clustering with non-elliptically contoured distributions, Stat Comput, 19, 73-83, (2009)
[20] Kass, RE; Raftery, AE, Bayes factors, J Am Stat Assoc, 90, 773-795, (1995) · Zbl 0846.62028
[21] Kosmidis I, Karlis D (2015) Model-based clustering using copulas with applications. Stat Comput 1-21. doi:10.1007/s11222-015-9590-5 · Zbl 06652996
[22] Lawrence, CJ; Krzanowski, WJ, Mixture separation for mixed-mode data, Stat Comput, 6, 85-92, (1996)
[23] Marbac M, Biernacki C, Vandewalle V (2015) Model-based clustering of Gaussian copulas for mixed data. arXiv:1405.1299 (preprint) · Zbl 1384.62198
[24] McLachlan, G.; Peel, D.; Amin, A. (ed.); Dori, D. (ed.); Pudil, P. (ed.); Freeman, H. (ed.), Robust cluster analysis via mixtures of multivariate t-distributions, No. 1451, 658-666, (1998), Berlin
[25] McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions. Wiley, New Jersey · Zbl 1165.62019
[26] McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New Jersey · Zbl 0963.62061
[27] McParland, D.; Gormley, IC; Poel, D. (ed.); Ultsch, A. (ed.); Lausen, B. (ed.), Clustering ordinal data via latent variable models, 127-135, (2013), Berlin
[28] McParland, D.; Gormley, IC; McCormick, TH; Clark, SJ; Kabudula, CW; Collinson, MA, Clustering South African households based on their asset status using latent variable models, Ann Appl Stat, 8, 747-776, (2014) · Zbl 1454.62503
[29] McParland D, Gormley IC, Phillips CM, Brennan L, Roche HM (2014b) Clustering mixed continuous and categorical data from the LIPGENE metabolic syndrome study: joint analysis of phenotypic and genetic data. Technical Report, University College Dublin
[30] Morlini, I., A latent variable approach for clustering mixed binary and continuous variables within a Gaussian mixture model, Adv Data Anal Classif, 6, 5-28, (2011) · Zbl 1284.62384
[31] Murray, JS; Dunson, DB; Carin, L.; Lucas, JE, Bayesian Gaussian copula factor models for mixed data, J Am Stat Assoc, 108, 656-665, (2013) · Zbl 06195968
[32] Muthén, B.; Shedden, K., Finite mixture modeling with mixture outcomes using the EM algorithm, Biometrics, 55, 463-469, (1999) · Zbl 1059.62599
[33] O’Hagan A (2012) Topics in model based clustering and classification. PhD thesis, University College Dublin
[34] O’Hagan, A.; Murphy, TB; Gormley, IC, Computational aspects of ftting mixture models via the expectation-maximisation algorithm, Comput Stat Data Anal, 56, 3843-3864, (2012) · Zbl 1255.62180
[35] Quinn, KM, Bayesian factor analysis for mixed ordinal and continuous responses, Political Anal, 12, 338-353, (2004)
[36] R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
[37] Schwarz, G., Estimating the dimension of a model, Ann Stat, 6, 461-464, (1978) · Zbl 0379.62005
[38] Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New Jersey · Zbl 0646.62013
[39] Wei, GCG; Tanner, MA, A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms, J Am Stat Assoc, 85, 699-704, (1990)
[40] Willse, A.; Boik, RJ, Identifiable finite mixtures of location models for clustering mixed-mode data, Stat Comput, 9, 111-121, (1999)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.