Convex clustering for binary data. (English) Zbl 1459.62109

Summary: We present a new clustering algorithm for multivariate binary data. The new algorithm is based on the convex relaxation of hierarchical clustering, which is achieved by considering the binomial likelihood as a natural distribution for binary data and by formulating convex clustering using a pairwise penalty on prototypes of clusters. Under convex clustering, we show that the typical \(\ell_1\) pairwise fused penalty results in ineffective cluster formation. In an attempt to promote the clustering performance and select the relevant clustering variables, we propose the penalized maximum likelihood estimation with an \(\ell_2\) fused penalty on the fusion parameters and an \(\ell_1\) penalty on the loading matrix. We provide an efficient algorithm to solve the optimization by using majorization-minimization algorithm and alternative direction method of multipliers. Numerical studies confirmed its good performance and real data analysis demonstrates the practical usefulness of the proposed method.


62H30 Classification and discrimination; cluster analysis (statistical aspects)


UCI-ml; sparcl
Full Text: DOI


[1] Beck, A.; Teboulle, M., A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J Imaging Sci, 2, 183-202 (2009) · Zbl 1175.94009
[2] Böhning, D., Multinomial logistic regression algorithm, Ann Inst Stat Math, 44, 197-200 (1992) · Zbl 0763.62038
[3] Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J., Distributed optimization and statistical learning via the alternating direction method of multipliers, Found Trend \({}^{\textregistered }\) Mach Learn, 3, 1-122 (2011) · Zbl 1229.90122
[4] Chi, EC; Lange, K., Splitting methods for convex clustering, J Comput Graph Stat, 24, 994-1013 (2015)
[5] Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R., Least angle regression, Ann Stat, 32, 407-499 (2004) · Zbl 1091.62054
[6] Finch, H., Comparison of distance measures in cluster analysis with dichotomous data, J Data Sci, 3, 85-100 (2005)
[7] Goldstein, T.; O’Donoghue, B.; Setzer, S.; Baraniuk, R., Fast alternating direction optimization methods, SIAM J Imaging Sci, 7, 1588-1623 (2014) · Zbl 1314.49019
[8] Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore · Zbl 0865.65009
[9] Hallac D, Leskovec J, Boyd S (2015) Network lasso: clustering and optimization in large graphs. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp 387-396
[10] Hocking TD, Joullin A, Bach F, Vert J-P (2011) Cluterpath: an algorithm for clustering using convex fusion penalties. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 754-762
[11] Jolliffe IT (2012) Principal component analysis, 2nd edn. Springer, New York · Zbl 1011.62064
[12] Lange K (2004) Optimization. Springer, New York
[13] Lee, S.; Huang, JZ, A biclustering algorithm for binary matrices based on penalized Bernoulli likelihood, Stat Comput, 24, 429-441 (2014) · Zbl 1325.62013
[14] Lee, S.; Huang, JZ; Hu, J., Sparse logistic principal component analysis for binary data, Ann Appl Stat, 4, 1579-1601 (2010) · Zbl 1202.62084
[15] Li, T., A unified view on clustering binary data, Mach Learn, 62, 199-215 (2006)
[16] Lichman M (2013) UCI machine learning repository [http://archive.ics.uci.edu/ml]. University of California, School of Information and Computer Science, Irvine
[17] Pan, W.; Shen, X.; Liu, B., Cluster analysis: unsupervised learning via supervised learning with a non-convex penalty, J Mach Learn Res, 14, 1865-1889 (2013) · Zbl 1317.68179
[18] Polson, NG; Scott, JG; Willard, BT, Proximal algorithms in statistics and machine learning, Stat Sci, 30, 559-581 (2015) · Zbl 1426.62213
[19] Rand, WM, Objective criteria for the evaluation of clustering methods, J Am Stat Assoc, 66, 846-850 (1971)
[20] Shen, X.; Huang, HC, Grouping pursuit through a regularization solution surface, J Am Stat Assoc, 105, 727-739 (2010) · Zbl 1392.62192
[21] Shen, X.; Pan, W., Simultaneous supervised clustering and feature selection over a graph, Biometrika, 99, 899-914 (2012) · Zbl 1452.62467
[22] Tibshirani, R.; Walther, G.; Hastie, T., Estimating the number of clusters in a data set via the gap statistic, J R Stat Soc Ser B, 63, 411-423 (2001) · Zbl 0979.62046
[23] Turlach, BA; Venables, W., Simultaneous variable selection, Technometrics, 47, 349-363 (2005)
[24] Witten, DM; Tibshirani, R., A framework for feature selection in clustering, J Am Stat Assoc, 105, 713-726 (2010) · Zbl 1392.62194
[25] Wu, C.; Kwon, S.; Shen, X.; Pan, W., A new algorithm and theory for penalized regression-based clustering, J Mach Learn Res, 17, 1-25 (2016) · Zbl 1392.68371
[26] Yang, H.; Liu, X., Studies on the clustering algorithm for analyzing gene expression data with a bidirectional penalty, J Comput Biol, 24, 689-698 (2017)
[27] Yang Y, Guan X, You J (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: SIGKDD ’02 Edmonton, Alberta, Canada, pp 682-687
[28] Zhang Z, Li T, Ding C, Zhang X (2007) Binary matrix factorization with applications. In: IEEE international conference on data mining, pp 391-400
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.