×

Penalized model-based clustering with unconstrained covariance matrices. (English) Zbl 1326.62143

Summary: Clustering is one of the most useful tools for high-dimensional analysis, e.g., for microarray data. It becomes challenging in presence of a large number of noise variables, which may mask underlying clustering structures. Therefore, noise removal through variable selection is necessary. One effective way is regularization for simultaneous parameter estimation and variable selection in model-based clustering. However, existing methods focus on regularizing the mean parameters representing centers of clusters, ignoring dependencies among variables within clusters, leading to incorrect orientations or shapes of the resulting clusters. In this article, we propose a regularized Gaussian mixture model with general covariance matrices, taking various dependencies into account. At the same time, this approach shrinks the means and covariance matrices, achieving better clustering and variable selection. To overcome one technical challenge in estimating possibly large covariance matrices, we derive an E-M algorithm to utilize the graphical lasso [J. Friedman et al., Biostatistics 9, No. 3, 432–441 (2008; Zbl 1143.62076)] for parameter estimation. Numerical examples, including applications to microarray gene expression data, demonstrate the utility of the proposed method.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62J07 Ridge regression; shrinkage estimators (Lasso)

Citations:

Zbl 1143.62076

Software:

HdBCS; glasso
PDF BibTeX XML Cite
Full Text: DOI Euclid

References:

[1] Alaiya, A.A. et al. (2002). Molecular classification of borderline ovarian tumors using hierarchical cluster analysis of protein expression profiles., Int. J. Cancer , 98 , 895-899.
[2] Antonov, A.V., Tetko, I.V., Mader, M.T., Budczies, J. and Mewes, H.W. (2004). Optimization models for cancer classification: extracting gene interaction information from microarray expression data., Bioinformatics , 20 , 644-652.
[3] Baker, Stuart G. and Kramer, Barnett S. (2006). Identifying genes that contribute most to good classification in microarray., BMC Bioinformatics , Sep 7; 7:407.
[4] Banfield, J.D. and Raftery, A.E. (1993). Model-Based Gaussian and Non-Gaussian Clustering., Biometrics , 49 , 803-821. · Zbl 0794.62034
[5] Bardi, E., Bobok, I., Olah, A.V., Olah, E., Kappelmayer, J. and Kiss, C. (2004). Cystatin C is a suitable marker of glomerular function in children with cancer., Pediatric Nephrology , 19 , 1145-1147.
[6] Carvalho, C.M. and Scott, J.G. (2009). Objective Bayesian model selection in Gaussian graphical models., Biometrika , 96 , 497-512. · Zbl 1170.62020
[7] Chi, J-T. et al. (2003). Endothelial cell diversity revealed by global expression profiling., PNAS , 100 , 10623-10628.
[8] Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion)., JRSS-B , 39 , 1-38. · Zbl 0364.62022
[9] Dudoit, S., Fridlyand, J. and Speed, T. (2002). Comparison of discrimination methods for the classification of tumors using expression data., J. Am. Stat. Assoc. , 97 , 77-87. · Zbl 1073.62576
[10] Eisen, M., Spellman, P., Brown, P. and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns., PNAS , 95 , 14863-14868.
[11] Fan, J., Feng, Y. and Wu, Y. (2009). Network exploration via the adaptive LASSO and SCAD penalties., Ann. Appl. Stat. , 3 , 521-541. · Zbl 1166.62040
[12] Friedman, J., Hastie, T. and Tibshirani, R. (2007). Sparse inverse covariance estimation with the graphical lasso., Biostatistics , 0 , 1-10. · Zbl 1143.62076
[13] Fraley, C. and Raftery, A.E. (1998). How many clusters? Which clustering methods? Answers via model-based cluster analysis., Computer J. , 41 , 578-588. · Zbl 0920.68038
[14] Golub, T. et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring., Science , 286 , 531-537.
[15] Guo, F.J., Levina, E., Michailidis, G. and Zhu, J. (2009). Pairwise Variable Selection for High-dimensional Model-based Clustering. To appear in, Biometrics . · Zbl 1203.62190
[16] Hoff. P.D. (2006). Model-based subspace clustering., Bayesian Analysis , 1 , 321-344. · Zbl 1331.62309
[17] Huang, J.Z., Liu, N., Pourahmadi, M. and Liu, L. (2006). Covariance selection and estimation via penalised normal likelihood., Biometrika , 93 , 85-98. · Zbl 1152.62346
[18] Jiang, A., Pan, W., Yu, S. and Robert, P.H. (2007). A practical question based on cross-platform microarray data normalization: are BOEC more like large vessel or microvascular endothelial cells or neither of them?, Journal of Bioinformatics and Computational Biology 5 875-893.
[19] Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C. and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models., Statist. Sci. , 20 , 388-400. · Zbl 1130.62408
[20] Kim, S., Tadesse, M.G. and Vannucci, M. (2006). Variable selection in clustering via Dirichlet process mixture models., Biometrika , 93 , 877-893. · Zbl 1436.62266
[21] Lau, J.W. and Green, P.J. (2007) Bayesian model based clustering procedure., Journal of Computational and Graphical Statistics , 16 , 526-558.
[22] Levina, L., Rothman, A. and Zhu, J. (2008). Sparse estimation of large covariance matrices via a nested lasso penalty., Annals of Applied Statistics , 2 , 245-263. · Zbl 1137.62338
[23] Liang, F., Mkherjee, S. and West, M. (2007). The use of unlabeled data in predictive modeling., Statistical Science , 22 , 189-205. · Zbl 1246.62157
[24] Liao, J.G. and Chin, K.V. (2007). Logistic regression for disease classification using microarray data: model selection in a large p and small n case., Bioinformatics , 23 , 1945-1951.
[25] Liu, J.S., Zhang, J.L., Palumbom M.J. and Lawrencem C.E. (2003). Bayesian clustering with variable and transformation selection (with discussion)., Bayesian Statistics 7 , 249-275.
[26] McLachlan, G. (1987). On bootstrapping likelihood ratio test statistics for the number of components in a normal mixture., Applied Statistics 36 , 318-324.
[27] McLachlan, G.J., Bean, R.W. and Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data., Bioinformatics , 18 , 413 - 422.
[28] McLachlan, G.J. and Peel, D. (2002)., Finite Mixture Model. New York, John Wiley & Sons, Inc. · Zbl 0963.62061
[29] Muller, P., Erkanli, A. and West, M. (1996). Bayesian curve fitting using multivariate normal mixtures., Biometrika , 83 , 67-79. · Zbl 0865.62029
[30] Pan, W. (2006). Incorporating gene functions as priors in model-based clustering of microarray gene expression data., Bioinformatics , 22 , 795-801.
[31] Pan, W. and Shen, X. (2007). Penalized model-based clustering with application to variable selection., Journal of Machine Learning Research , 8 , 1145-1164. · Zbl 1222.68279
[32] Pan, W., Shen, X., Jiang, A. and Hebbel, R.P. (2006). Semi-supervised learning via penalized mixture model with application to microarray sample classification., Bioinformatics 22 , 2388-2395.
[33] Raftery, A.E. and Dean, N. (2006). Variable selection for model-based clustering., Journal of the American Statistical Association , 101 , 168-178. · Zbl 1118.62339
[34] Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods., JASA , 66 , 846-850.
[35] Rothman, A., Levina, L. and Zhu, J. (2009). Generalized thresholding of large covariance matrices., JASA , 2009, 104(485): 177-186. · Zbl 1388.62170
[36] Richardson, S. and Green, P.J. (1997). On Bayesian analysis of mixture models (with Discussion)., J R Statist Soc B , 59 , 731-792. · Zbl 0891.62020
[37] Schwarz, G. (1978). Estimating the dimension of a model., Annals of Statistics , 6 , 461-464. · Zbl 0379.62005
[38] Scott, J.G. and Carvalho, C.M. (2009). Feature-inclusion stochastic search for Gaussian graphical models., J. Comp. Graph. Stat. , 17 , 790-808.
[39] Tadesse, M.G., Sha, N. and Vannucci, M. (2005). Bayesian variable selection in clustering high-dimensional data., Journal of the American Statistical Association , 100 , 602-617. · Zbl 1117.62433
[40] Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J. and Church, G.M. (1999) Systematic determination of genetic network architecture., Nat. Genet , 22 , 281-285.
[41] Thalamuthu, A., Mukhopadhyay, I., Zheng, X. and Tseng, G.C. (2006). Evaluation and comparison of gene clustering methods in microarray analysis., Bioinformatics , 22 , 2405-2412.
[42] Teh, Y.W., Jordan, M.I., Beal, M.J. and Beal, M.J. (2004). Sharing clusters among related groups: Hierarchical Dirichlet processes., NIPS .
[43] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso., JRSS-B , 58 , 267-288. · Zbl 0850.62538
[44] Tseng, G.C. (2007). Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data, Bioinformatics , 23 , 2247-2255.
[45] Tseng, P. (1988) Coordinate ascent for maximizing nondifferentiable concave functions. Technical report LIDS-P; 1840, Massachusetts Institute of Technology. Laboratory for Information and Decision, Systems.
[46] Tseng, P. (2001) Convergence of block coordinate descent method for nondifferentiable maximization., J. Opt. Theory and Applications , 109 , 474-494. · Zbl 1006.65062
[47] Wang, Y., Tetko, I.V., Hall, M.A., Frank, E., Facius, A., Mayer, K.F.X. and Mewes, H.W. (2005). Gene selection from microarray data for cancer classification - a machine learning approach., Comput Biol Chem , 29 , 37-46. · Zbl 1095.92040
[48] Wang, S. and Zhu, J. (2008). Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data., Biometrics , 64 , 440-448. · Zbl 1137.62041
[49] Wasserman, L. (2000). Asymptotic inference for mixture models using data-dependent priors., J R Statist Soc B , 62 , 159-180. · Zbl 0976.62028
[50] Xie, B., Pan, W. and Shen, X. (2008a). Variable selection in penalized model-based clustering via regularization on grouped parameters., Biometrics , 64 , 921-930. · Zbl 1146.62101
[51] Xie, B., Pan, W. and Shen, X. (2008b). Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables., Electron. J. Statist. , 2 , 168-212. · Zbl 1135.62055
[52] Xie, B., Pan, W. and Shen, X. (2009). Penalized mixtures of factor analyzers with application to clustering high dimensional microarray data. To appear, Bioinformatics . Available at http://www.biostat.umn.edu./rrs.php as Research Report 2009-019, Division of Biostatistics, University of Minnesota.
[53] Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model., Biometrika , 94 , 19-35. · Zbl 1142.62408
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.