Regularized \(k\)-means clustering of high-dimensional data and its asymptotic consistency. (English) Zbl 1335.62109

Summary: \(k\)-means clustering is a widely used tool for cluster analysis due to its conceptual simplicity and computational efficiency. However, its performance can be distorted when clustering high-dimensional data where the number of variables becomes relatively large and many of them may contain no information about the clustering structure. This article proposes a high-dimensional cluster analysis method via regularized \(k\)-means clustering, which can simultaneously cluster similar observations and eliminate redundant variables. The key idea is to formulate the \(k\)-means clustering in a form of regularization, with an adaptive group lasso penalty term on cluster centers. In order to optimally balance the trade-off between the clustering model fitting and sparsity, a selection criterion based on clustering stability is developed. The asymptotic estimation and selection consistency of the regularized \(k\)-means clustering with diverging dimension is established. The effectiveness of the regularized \(k\)-means clustering is also demonstrated through a variety of numerical experiments as well as applications to two gene microarray examples. The regularized clustering framework can also be extended to the general model-based clustering.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T10 Pattern recognition, speech recognition
62J07 Ridge regression; shrinkage estimators (Lasso)


boost; sparcl
Full Text: DOI Euclid


[1] Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., Powell, J., Yang, L., Marti, G., Moore, T., Hudson, J., Lu, L., Lewis, D., Tibshirani, R., Sherlock, R., Chan, W., Greiner, T., Weisenburger, D., Armitage, Warnke, R., Levy, R., Wilson, W., Grever, M., Byrd, J., Botstein, D., Brown, P., and Staudt, L. (2000), “Different Types of Diffuse Large B-cell Lymphoma Identified by Gene Expression Profiling,”, Nature , 403 , 503-511.
[2] Bansal, N., Blum, A., and Chawla, S. (2004), “Correlation Clustering,”, Machine Learning , 56 , 86-113. · Zbl 1089.68085
[3] Ben-Hur, A., Elisseeff, A., and Guyon, I. (2002), “A Stability Based Method for Discovering Structure in Clustered Data,”, Pacific Symposium on Biocomputing , 6-17.
[4] Breiman, L. (1995), “Better Subset Regression Using the Nonnegative Garrote,”, Technometrics , 37 , 373-384. · Zbl 0862.62059
[5] Changdra, B., Shanker, S., and Mishra, S. (2006), “A New Approach: Interrelated Two-way Clustering of Gene Expression Data,”, Statistical Methodology , 3 , 93-102. · Zbl 1248.62208
[6] Dettling, M. (2004), “BagBoosting for Tumor Classification with Gene Expression Data,”, Bioinformatics , 20 , 3583-3593.
[7] Donoho, D., and Jin, J. (2008), “Higher Criticism Thresholding: Optimal Feature Selection When Useful Features are Rare and Weak,”, The Proceedings of the National Academy of Sciences of the United States of America , 105 , 14790-14795. · Zbl 1357.62212
[8] Dudoit, S., Fridlyand, J., and Speed, T. (2002), “Comparision of Discrimination Methods for the Classification of Tumor Using Gene Expression Data,”, Journal of the American Statistical Association , 97 , 77-87. · Zbl 1073.62576
[9] Fan, J. and Peng, H. (2004), “Nonconcave penalized likelihood with a diverging number of parameters,”, The Annals of Statistics , 32 , 928-961. · Zbl 1092.62031
[10] Fang, Y. and Wang, J. (2011), “Penalized Cluster Analysis with Applications to Family Data,”, Computational Statistics and Data Analysis , 55 , 2128-2136. · Zbl 1328.62384
[11] Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, A., Bloomfield, C., and Lander, E. (1999), “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,”, Science , 286 , 531-537.
[12] Guo, J., Levina, E., Michailidis, G., and Zhu, J. (2010), “Pairwise Variable Selection for High-dimensional Model-based Clustering,”, Biometrics , 66 , 793-804. · Zbl 1203.62190
[13] Hall, P., Marron, J.S., and Neeman, A. (2005), “Geometric Representation of High Dimension, Low Sample Size Data,”, Journal of the Royal Statistical Society, Series B , 67 , 427-444. · Zbl 1069.62097
[14] Lloyd, S.P. (1982), “Least Squares Quantization in PCM,”, IEEE Transactions on Information Theory , 28 , 129-137. · Zbl 0504.94015
[15] MacQueen, J. (1967), “Some Methods for Clasification and Analysis of Multivariate Observations,”, In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability , 281-297. · Zbl 0214.46201
[16] Pan, W., and Shen, X. (2007), “Penalized Model-Based Clustering with Application to Variable Selection,”, Journal of Machine Learning Research , 8 , 1145-1164. · Zbl 1222.68279
[17] Pollard, D. (1981), “Strong Consistency of K-means Clustering,”, The Annals of Statistics , 9 , 135-140. · Zbl 0451.62048
[18] Pollard, D. (1982), “A Central Limit Theorem for K-means Clustering,”, The Annals of Probability , 10 , 919-926. · Zbl 0841.22013
[19] Raftery, A., and Dean, N. (2006), “Variable Selection for Model-based Clustering,”, Journal of the American Statistical Association , 101 , 168-178. · Zbl 1118.62339
[20] Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,”, Journal of the Royal Statistical Society, Series B , 58 , 267-288. · Zbl 0850.62538
[21] Tibshirani, R., Walther, G., and Hastie, T. (2001), “Estimating the Number of Clusters in a Data Set via the Gap Statistic,”, Journal of the Royal Statistical Society, Series B , 63 , 411-423. · Zbl 0979.62046
[22] Wang, H., and Leng, C. (2008), “A Note on Adaptive Group Lasso,”, Computational Statistics and Data Analysis , 52 , 5277-5286. · Zbl 1452.62524
[23] Wang, J. (2010), “Consistent Selection of the Number of Clusters via Cross Validation,”, Biometrika , 97 , 893-904. · Zbl 1204.62104
[24] Wang, S., and Zhu, J. (2008), “Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data,”, Biometrics , 64 , 440-448. · Zbl 1137.62041
[25] Witten, D., and Tibshirani, R. (2010), “A Framework for Feature Selection in Clustering,”, Journal of the American Statistical Association , 105 , 713-726. · Zbl 0819.65141
[26] Xie, B., Pan, W., and Shen, X. (2008), “Variable Selection in Penalized Model-Based Clustering Via Regularization on Grouped Parameters,”, Biometrics , 64 , 921-930. · Zbl 1146.62101
[27] Yuan, M. and Lin, Y. (2006), “Model Selection and Estimation in Regression with Grouped Variables,”, Journal of the Royal Statistical Society, Series B , 68 , 49-67. · Zbl 1141.62030
[28] Zou, H. (2006), “The Adaptive Lasso and Its Oracle Properties,”, Journal of the American Statistical Association , 101 , 1418-1429. · Zbl 1171.62326
[29] Zou, H., and Zhang, H. (2009), “On the Adaptive Elastic-net with A Diverging Number of Parameters,”, The Annals of Statistics , 37 , 1733-1751. · Zbl 1168.62064
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.