A latent variables approach for clustering mixed binary and continuous variables within a Gaussian mixture model. (English) Zbl 1284.62384

Summary: For clustering objects, we often collect not only continuous variables, but binary attributes as well. This paper proposes a model-based clustering approach with mixed binary and continuous variables where each binary attribute is generated by a latent continuous variable that is dichotomized with a suitable threshold value, and where the scores of the latent variables are estimated from the binary data. In economics, such variables are called utility functions and the assumption is that the binary attributes (the presence or the absence of a public service or utility) are determined by low and high values of these functions. In genetics, the latent response is interpreted as the ‘liability’ to develop a qualitative trait or phenotype. The estimated scores of the latent variables, together with the observed continuous ones, allow to use a multivariate Gaussian mixture model for clustering, instead of using a mixture of discrete and continuous distributions. After describing the method, this paper presents the results of both simulated and real-case data and compares the performances of the multivariate Gaussian mixture model and of a mixture of joint multivariate and multinomial distributions. Results show that the former model outperforms the mixture model for variables with different scales, both in terms of classification error rate and reproduction of the clusters means.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P20 Applications of statistics to economics
91C20 Clustering in the social and behavioral sciences
Full Text: DOI


[1] Banfield JD, Raftery AE (1993) Model based Gaussian and non Gaussian clustering. Biometrics 48: 803–821 · Zbl 0794.62034
[2] Bartholomew DJ, Tzamourani P (1999) The goodness-of-fit of latent trait models in attitude measurements. Sociol Methods Res 27: 525–546
[3] Bonett DG, Price RM (2005) Inferential methods for the tetrachoric correlation coefficient. J Educ Behav Stat 30: 213–225
[4] Bock RD, Mislevy RJ (1982) Adaptive EAP estimation of ability in a microcomputer environment. Appl Psychol Meas 6–4: 431–444
[5] Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28: 781–793 · Zbl 05480211
[6] Chaturvedi AD, Carrol JD, Green PE, Rotondo JA (1997) A feature based approach to market segmentation via overlapping k-centroids clusters. J Mark Res 34: 370–377
[7] Chaturvedi AD, Green PE, Carrol JD (2001) K-modes clustering. J Classif 18: 35–55
[8] Edwards JH, Edwards AWF (1984) Approximating the tetrachoric correlation coefficient. Biometrics 40: 563
[9] Everitt BS (1988) A finite mixture model for the clustering of mixed mode data. Stat Probab Lett 6: 305–309
[10] Everitt BS, Merette C (1990) The clustering of mixed-mode data: a comparison of possible approaches. J Appl Stat 17(3): 284–297
[11] Glas CAW (1999) Modification indices for the 2PL and the nominal response model. Psychometrika 64: 273–294 · Zbl 1291.62207
[12] Gringorten II (1971) Comparison of models for estimating the joint probability of a weather event. J Appl Meteorol 21: 1926–1928
[13] Harris B (1988) Tetrachoric correlation coefficient. In: Kotz L, Johnson NL (eds) Encyclopedia of Statistical Sciences, vol 9. Wiley, New York, pp 223–225
[14] Heckman JJ (1978) Dummy endogenous variables in a simultaneous equation system. Econometrica 47: 153–161 · Zbl 0392.62093
[15] Helsen K, Green PE (1991) A computational study of replicated clustering with an application to marketing research. Decis Sci 22: 1124–1141
[16] Hunt LA, Jorgensen MA (1999) Mixture model clustering using the MULTIMIX program. Aust N Z J Stat 41: 154–171 · Zbl 0962.62061
[17] Jöreskog KG, Sörbom D (1999) PRELIS 2: User’s Reference Guide. Scientific Software International Inc, Lincolnwood, IL
[18] Juras J (1982) Modeling conditional probability. J Appl Meteorol 10: 646–657
[19] Juras J, Pasaric Z, (2006) Application of tetrachoric and Polychoric correlation coefficients to forecast verfication. Geofizika 23:59–82
[20] Lawrence CJ, Krzanowski WJ (1996) Mixture separation for mixed-mode data. Stat Comput 6: 85–92
[21] Manski C (1988) Identification of binary response models. J Am Stat Assoc 83: 729–738 · Zbl 0684.62049
[22] McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York · Zbl 0963.62061
[23] Morlini I (2011) Mixed mode data clustering: an approach based on tetrachoric correlations. In: Fichet B, Piccolo D, Verde R, Vichi M (eds) Classification and multivariate analysis for complex data structures. Springer-Verlag, Berlin, pp 95–103
[24] Muraki E, Engelhard G (1985) Full-information item factor analysis: application of EAP scores. Appl Psychol Meas 9(4): 417–430
[25] National Bureau of Standards: (1959) Tables of bivariate normal distribution function and related functions. Number 50 in applied mathematical series. Printing Office, Washington D.C.
[26] Nowak E (1985) Wskaznik podobienstwa wynikow podzialow. Przeglad Statystyczny 1: 41–48
[27] Olsson, U (1979) Maximum likelihood estimation of the polychoric correlation coefficient. Psichometrika 44:443–460 · Zbl 0428.62083
[28] Pearson K (1900) Mathematical contributions to the theory of evolution. VII. On the correlation of characters not quantitatively measurable. Philos Trans R Soc Lond Ser A 195: 147 · JFM 31.0237.03
[29] Pearson K, Heron D (1913) On theories of association. Biometrika 9: 159–315
[30] Reiser M (1996) Analysis of residuals for the multinomial item response model. Psychometrika 61: 509–528 · Zbl 0863.62086
[31] Reiser M, Lin Y (1999) Goodness of fit tests for the latent class model when expected frequencies are small. In: Sobel M, Besker M (eds) Sociological Methodology 1999. Blackwell Publishers, Boston, pp 81–111
[32] Skrondal A, Rabe-Hesketh S (2004) Generalized latent variable modeling: multilevel, longitudinal and structural equation models. Chapman & Hall/CRC, Boca Raton, FL · Zbl 1097.62001
[33] Vermunt JK, Magidson J (2000) Latent gold user’s guide. Statistical Innovations Inc, Belmont, MA
[34] Vermunt JK, Magidson J (2002) Latent class cluster analysis. In: Hagenaars JA, McCutcheon AL (eds) Applied latent class analysis. Cambridge University Press, Cambridge, pp 89–106
[35] Vermunt JK, Magidson J (2005) Technical guide for latent GOLD 4.0: basic and advanced. Statistical Innovations Inc, Belmont, MA
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.