×

Model-based clustering of censored data via mixtures of factor analyzers. (English) Zbl 1496.62109

Summary: Mixtures of factor analyzers (MFA) provide a promising tool for modeling and clustering high-dimensional data that contain an overwhelmingly large number of attributes measured on individuals arisen from a heterogeneous population. Due to the restriction of experimental apparatus, measurements can be limited to some lower and/or upper detection bounds and thus the data are possibly censored. In this paper, we extend the MFA to accommodate censored data, and the new model is called the MFA with censoring (MFAC). A computationally feasible alternating expectation conditional maximization (AECM) algorithm is developed to carry out maximum likelihood estimation of the MFAC model. Practical issues related to model-based clustering and recovery of censored data are also discussed. Simulation studies are conducted to examine the effect of censoring in classification, estimation and cluster validation. We also present an application of the proposed approach to two real data examples in which a certain number of left-censored observations are present.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62N01 Censored data models

Software:

CensMixReg; AS 136
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Anderson, T. W., An Introduction to Multivariate Statistical Analysis (2003), Wiley and Sons: Wiley and Sons New York · Zbl 1039.62044
[2] Azzalini, A.; Capitaino, A., Statistical applications of the multivariate skew-normal distribution, J. R. Stat. Soc. Ser. B, 61, 579-602 (1999) · Zbl 0924.62050
[3] Azzalini, A.; Capitaino, A., Distributions generated by perturbation of symmetry with emphasis on a multivariate skew \(t\)-distribution, J. R. Stat. Soc. Ser. B Stat. Methodol., 65, 367-389 (2003) · Zbl 1065.62094
[4] Azzalini, A.; Dalla Valle, A., The multivariate skew-normal distribution, Biometrika, 83, 715-726 (1996) · Zbl 0885.62062
[5] Baek, J.; McLachlan, G. J.; Flack, L. K., Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualization of high-dimensional data, IEEE Trans. Pattern Anal. Mach. Intell., 32, 1-13 (2010)
[6] Bhattacharjee, A.; Richards, W. G.; Staunton, J.; Li, C.; Monti, S.; Vasa, P.; Ladd, C.; Beheshti, J.; Bueno, R.; Gillette, M.; Loda, M.; Weber, G.; Mark, E. J.; Lander, E. S.; Wong, W.; Johnson, B. E.; Golub, T. R.; Sugarbaker, D. J.; Meyerson, M., Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinomas sub-classes, Proc. Natl. Acad. Sci., 98, 24, 13790-13795 (2001)
[7] Biernacki, C.; Celeux, G.; Govaert, G., Assessing a mixture model for clustering with the integrated complete likelihood, IEEE Trans. Pattern Anal. Mach. Intell., 22, 7, 719-725 (2000)
[8] Castro, L. M.; Costa, D. R.; Prates, O. M.; Lachos, V. H., Likelihood-based inference for Tobit confirmatory factor analysis using the multivariate Student-\(t\) distribution, Stat. Comput., 25, 1163-1183 (2015) · Zbl 1331.62294
[9] Caudill, S. B., A partially adaptive estimator for the censored regression model based on a mixture of normal distributions, Statist. Meth. Appl., 21, 121-137 (2012)
[10] Cheng, L.; Wong, W., Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application, Genome Biol., 2, 8, 1-11 (2001)
[11] Cohen, A. C., On the solution of estimating equations for truncated and censored samples from normal populations, Biometrika, 44, 225-236 (1957) · Zbl 0080.35603
[12] Cohen, A. C., Simplified estimators for the normal distribution when samples are singly censored or truncated, Technometrics, 1, 3, 217-237 (1959)
[13] Costa, D. R.; Lachos, V. H.; Bazan, J. L.; Azevedo, C. L.N., Estimation methods for multivariate tobit confirmatory factor analysis, Comput. Statist. Data Anal., 79, 248-260 (2014) · Zbl 1506.62048
[14] Dempster, A. P.; Laird, N. M.; Rubin, D. B., Maximum likelihood from incomplete data via the EM algorithm (with discussion), J. R. Stat. Soc. Ser. B Stat. Methodol., 9, 1-38 (1977) · Zbl 0364.62022
[15] Fokoué, E.; D. M., Titterington., Mixtures of factor analyzers. Bayesian estimation and inference by stochastic simulation, Mach. Learn., 50, 73-94 (2003) · Zbl 1033.68085
[16] Ghahramani, Z., Hinton, G.E., 1997. The EM algorithm for factor analyzers, Technical Report No. CRG-TR-96-1. The University of Toronto, Toronto.; Ghahramani, Z., Hinton, G.E., 1997. The EM algorithm for factor analyzers, Technical Report No. CRG-TR-96-1. The University of Toronto, Toronto.
[17] Hartigan, J. A.; Wong, M. A., Algorithm AS 136: A K-means clustering algorithm, Appl. Statist., 28, 100-108 (1979) · Zbl 0447.62062
[18] He, J., Mixture model based multivariate statistical analysis of multiply censored environmental data, Adv. Water Resour., 59, 15-24 (2013)
[19] Hewett, P.; Ganser, G. H., A comparison of several methods for analyzing censored data, Ann. Occup. Hyg, 51, 7, 611-632 (2007)
[20] Hinton, G.; Dayan, P.; Revow, M., Modeling the manifolds of images of handwritten digits, IEEE Trans. Neural Netw., 8, 65-73 (1997)
[21] Hoffman, H. J.; Johnson, R. E., Estimation of multiple trace metal water contaminants in the presence of left-censored and missing data, J. Environ. Statist., 2, 2, 1-16 (2011)
[22] Hoffman, H.; Johnson, R., Pseudo-likelihood estimation of multivariate normal parameters in the presence of left-censored data, J. Agric. Biol. Environ. Stat., 20, 156-171 (2015) · Zbl 1325.62211
[23] Horrace, W. C., Some results on the multivariate truncated normal distribution, J. Multivariate Anal., 94, 1, 209-221 (2005) · Zbl 1065.62098
[24] Hubert, L.; Arabie, P., Comparing partitions, J. Classif., 2, 193-218 (1985)
[25] Hughes, J. P., Mixed-effects models with censored data with application to HIV RNA levels, Biometrics, 55, 625-629 (1999) · Zbl 1059.62661
[26] Karlsson, M.; Laitila, T., Finite mixture modeling of censored regression models, Statist. Pap., 55, 627-642 (2014) · Zbl 1416.62215
[27] Kotz, S.; Nadarajah, S., Multivariate \(T\) Distributions and their Applications (2004), Cambridge University Press: Cambridge University Press Cambridge · Zbl 1100.62059
[28] Lachos, V. H.; Ghosh, P.; Arellano-Valle, R. B., Likelihood based inference for skew-normal independent linear mixed models, Statist. Sinica, 20, 303-322 (2010) · Zbl 1186.62071
[29] Lachos, V. H.; López Moreno, E. J.; Chen, K.; Cabral, C. R.B., Finite mixture modeling of censored data using the multivariate Student-\(t\) distribution, J. Multivariate Anal., 159, 151-167 (2017) · Zbl 1397.62221
[30] Ledermann, W., On the rank of the reduced correlational matrix in multiple factor analysis, Psychometrika, 2, 2, 85-93 (1937) · JFM 63.1109.03
[31] Lee, W. L.; Chen, Y. C.; Hsieh, K. S., Ultrasonic liver tissues classification by fractal feature vector based on M-band wavelet transform, IEEE Trans. Med. Imaging, 22, 382-392 (2003)
[32] Lin, T. I.; McLachlan, G. J.; Lee, S. X., Extending mixtures of factor models using the restricted multivariate skew-normal distribution, J. Multivariate Anal., 143, 398-413 (2016) · Zbl 1328.62378
[33] Liu, M.; Lin, T. I., A skew-normal mixture regression model, Educ. Psychol. Meas., 74, 139-162 (2014)
[34] McLachlan, G. J.; Bean, R. W.; Peel, D., A mixture model-based approach to the clustering of microarray expression data, Bioinformatics, 18, 413-422 (2002)
[35] McLachlan, G. J.; Peel, D., Finite Mixture Models (2000), Wiley: Wiley New York · Zbl 0963.62061
[36] McLachlan, G. J.; Peel, D.; Bean, R. W., Modelling high-dimensional data by mixtures of factor analyzers, Comput. Stat. Data Anal., 41, 379-388 (2003) · Zbl 1256.62036
[37] McNicholas, P.D., ElSherbiny, A., Jampani, R.K., McDaid, A.F., Murphy, B., Banks, L., 2015. pgmm: Parsimonious Gaussian Mixture Models. http://CRAN.R-project.org/package=pgmm; McNicholas, P.D., ElSherbiny, A., Jampani, R.K., McDaid, A.F., Murphy, B., Banks, L., 2015. pgmm: Parsimonious Gaussian Mixture Models. http://CRAN.R-project.org/package=pgmm
[38] McNicholas, P. D.; Murphy, T. B., Parsimonious Gaussian mixture models, Stat. Comput., 18, 3, 285-296 (2008)
[39] McNicholas, P. D.; Murphy, T. B., Model based clustering of microarray expression data via latent Gaussian mixture models, Bioinformatics, 26, 21, 2705-2712 (2010)
[40] McNicholas, P. D.; Murphy, T. B.; McDaid, A. F.; Frost, D., Serial and parallel implementations of model based clustering via parsimonious Gaussian mixture models, Comput. Stat. Data Anal., 54, 3, 711-723 (2010) · Zbl 1464.62131
[41] Meng, X. L.; van Dyk, D., The EM algorithm - an old folk-song sung to a fast new tune, J. R. Stat. Soc. Ser. B Stat. Methodol., 59, 511-567 (1997) · Zbl 1090.62518
[42] Meng, X. L.; Rubin, D. B., Maximum likelihood estimation via the ECM algorithm: a general framework, Biometrika, 80, 267-278 (1993) · Zbl 0778.62022
[43] Monti, S.; Tamayo, P.; Mesirov, J.; Golub, T., Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data, Mach. Learn., 52, 91-118 (2003) · Zbl 1039.68103
[44] Papastamoulis, P., Over fitting Bayesian mixtures of factor analyzers with an unknown number of components, Comput. Stat. Data Anal., 124, 220-234 (2018) · Zbl 1469.62125
[45] Papastamoulis, P., 2018b. fabMix: R code for Over fitting Bayesian mixtures of factor analyzers with an unknown number of components. https://github.com/mqbssppe/overfittingFABMix/; Papastamoulis, P., 2018b. fabMix: R code for Over fitting Bayesian mixtures of factor analyzers with an unknown number of components. https://github.com/mqbssppe/overfittingFABMix/ · Zbl 1469.62125
[46] Powell, J. L., Least absolute deviations estimation for the censored regression model, J. Econometrics, 25, 303-325 (1984) · Zbl 0571.62100
[47] Sahu, S. K.; Dey, D. K.; Branco, M. D., A new class of multivariate skew distributions with applications to Bayesian regression models, Can. J. Stat., 31, 129-150 (2003) · Zbl 1039.62047
[48] Schwarz, G., Estimating the dimension of a model, Ann. Statist., 6, 461-464 (1978) · Zbl 0379.62005
[49] Shumway, R. H.; Azari, R. S.; Johnson, P., Estimating mean concentrations under transformation for environmental data with detection limits, Technometrics, 31, 3, 347-356 (1989)
[50] Singh, A.; Nocerino, J., Robust estimation of mean and variance using environmental data sets with below detection limit observations, Chemom. Intell. Lab. Syst., 60, 69-86 (2002)
[51] Spearman, C., General intelligence, objectively determined and measured, Am. J. Psychol., 15, 201-292 (1904)
[52] Stephens, M., Bayesian Analysis of mixture models with an unknown number of components - an alternative to reversible jump methods, Ann. Statist., 28, 40-74 (2000) · Zbl 1106.62316
[53] Stephens, M., Dealing with label switching in mixture models, J. R. Stat. Soc. Ser. B Stat. Methodol., 62, 795-809 (2000) · Zbl 0957.62020
[54] Ullman, J. B., Structural equation modeling: reviewing the basics and moving forward, J. Pers. Assess., 87, 1, 35-50 (2006)
[55] VDEQ, 2003. The Quality of Virginia Non-Tidal Streams: First Year Report. VDEQ Technical Bulletin WQA/2002-2001, Office of Water Quality and Assessments, Virginia Department of Environmental Quality.; VDEQ, 2003. The Quality of Virginia Non-Tidal Streams: First Year Report. VDEQ Technical Bulletin WQA/2002-2001, Office of Water Quality and Assessments, Virginia Department of Environmental Quality.
[56] VDEQ, 2008. Virginia Water Quality Assessment. Integrated Report 305(b)/303(d) Virginia Department of Environmental Quality.; VDEQ, 2008. Virginia Water Quality Assessment. Integrated Report 305(b)/303(d) Virginia Department of Environmental Quality.
[57] VDEQ, 2009. Virginia Water Quality Standards. Technical Report Regulation 9 VAC 25-260, State Water Control Board, Virginia Department of Environmental Quality.; VDEQ, 2009. Virginia Water Quality Standards. Technical Report Regulation 9 VAC 25-260, State Water Control Board, Virginia Department of Environmental Quality.
[58] Wang, W. L., Mixture of multivariate \(t\) linear mixed models for multi-outcome longitudinal data with heterogeneity, Statist. Sinica, 27, 733-760 (2017) · Zbl 1391.62124
[59] Wang, W. L.; Lin, T. I.; Lachos, V. H., Extending multivariate-\(t\) linear mixed models for multiple longitudinal data with censored responses and heavy tails, Stat. Methods Med. Res, 27, 1, 48-64 (2018)
[60] Yao, W., Label switching and its simple solutions for frequentist mixture models, J. Stat. Comput. Simul., 85, 1000-1012 (2015) · Zbl 1457.62030
[61] Yao, W.; Lindsay, B. G., Bayesian Mixture labeling by highest posterior density, J. Amer. Statist. Assoc., 104, 758-767 (2009) · Zbl 1388.62007
[62] Zeller, C. B.; Cabral, C. R.; Lachos, V. H.; Benites, L., Finite mixture of regression models for censored data based on scale mixtures of normal distributions, Adv. Data Anal. Classif. (2018)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.