×

Generalized co-clustering analysis via regularized alternating least squares. (English) Zbl 1510.62056

Summary: Biclustering is an important exploratory analysis tool that simultaneously clusters rows (e.g., samples) and columns (e.g., variables) of a data matrix. Checkerboard-like biclusters reveal intrinsic associations between rows and columns. However, most existing methods rely on Gaussian assumptions and only apply to matrix data. In practice, non-Gaussian and/or multi-way tensor data are frequently encountered. A new CO-clustering method via Regularized Alternating Least Squares (CORALS) is proposed, which generalizes biclustering to non-Gaussian data and multi-way tensor arrays. Non-Gaussian data are modeled with single-parameter exponential family distributions and co-clusters are identified in the natural parameter space via sparse CANDECOMP/PARAFAC tensor decomposition. A regularized alternating (iteratively reweighted) least squares algorithm is devised for model fitting and a deflation procedure is exploited to automatically determine the number of co-clusters. Comprehensive simulation studies and three real data examples demonstrate the efficacy of the proposed method. The data and code are publicly available at https://github.com/reagan0323/CORALS.

MSC:

62-08 Computational methods for problems pertaining to statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H25 Factor analysis and principal components; correspondence analysis
62J12 Generalized linear models (logistic models)
62P10 Applications of statistics to biology and medical sciences; meta analysis
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Busygin, S.; Prokopyev, O.; Pardalos, P. M., Biclustering in data mining, Comput. Oper. Res., 35, 9, 2964-2987 (2008) · Zbl 1144.68309
[2] Cheng, Y.; Church, G. M., Biclustering of expression data, (Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (2000), AAAI Press), 93-103
[3] Chi, E. C.; Allen, G. I.; Baraniuk, R. G., Convex biclustering, Biometrics, 73, 1, 10-19 (2017) · Zbl 1366.62208
[4] Chi, E. C.; Gaines, B. R.; Sun, W. W.; Zhou, H.; Yang, J., Provable convex co-clustering of tensors (2018), ArXiv preprint, arXiv:1803.06518
[5] Chi, E. C.; Kolda, T. G., On tensors, sparsity, and nonnegative factorizations, SIAM J. Matrix Anal. Appl., 33, 4, 1272-1299 (2012) · Zbl 1262.15029
[6] Choi, D.; Wolfe, P. J., Co-clustering separately exchangeable network data, Ann. Statist., 42, 1, 29-63 (2014) · Zbl 1294.62059
[7] Collins, M.; Dasgupta, S.; Schapire, R. E., A generalization of principal components analysis to the exponential family, (Advances in Neural Information Processing Systems (2001)), 617-624
[8] Dhillon, I. S., Co-clustering documents and words using bipartite spectral graph partitioning, (Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2001), ACM), 269-274
[9] Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. J., Least angle regression, Ann. Statist., 32, 2, 407-499 (2004) · Zbl 1091.62054
[10] Fan, N.; Boyko, N.; Pardalos, P. M., Recent advances of data biclustering with application in computational neuroscience, (Comput. Neurosci. (2010), Springer), 85-112
[11] Fan, J.; Li, R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Stat. Assoc., 96, 456, 1348-1360 (2001) · Zbl 1073.62547
[12] Flynn, C. J.; Perry, P. O., Consistent biclustering (2012), ArXiv preprint, arXiv:1206.6927
[13] Gao, C.; Lu, Y.; Ma, Z.; Zhou, H. H., Optimal estimation and completion of matrices with biclustering structures, J. Mach. Learn. Res., 17, 1, 5602-5630 (2016) · Zbl 1392.62151
[14] Govaert, G.; Nadif, M., Block clustering with Bernoulli mixture models: Comparison of different approaches, Comput. Statist. Data Anal., 52, 6, 3233-3245 (2008) · Zbl 1452.62444
[15] Govaert, G.; Nadif, M., Latent block model for contingency table, Comm. Statist. Theory Methods, 39, 3, 416-425 (2010) · Zbl 1187.62117
[16] Govaert, G.; Nadif, M., Co-clustering: Models, Algorithms and Applications (2013), John Wiley & Sons · Zbl 0910.62021
[17] Hartigan, J. A., Direct clustering of a data matrix, J. Amer. Stat. Assoc., 67, 337, 123-129 (1972)
[18] Hecker, M.; Hartmann, C.; Kandulski, O.; Paap, B. K.; Koczan, D.; Thiesen, H.-J.; Zettl, U. K., Interferon-beta therapy in multiple sclerosis: the short-term and long-term effects on the patients’ individual gene expression in peripheral blood, Mol. Neurobiol., 48, 3, 737-756 (2013)
[19] Hong, D.; Kolda, T. G.; Duersch, J. A., Generalized canonical polyadic tensor decomposition, SIAM Rev., 62, 1, 133-163 (2020) · Zbl 1432.68385
[20] Keribin, C., Brault, V., Celeux, G., Govaert, G., 2012. Model selection for the binary latent block model. In: Proceedings of COMPSTAT, vol. 2012. · Zbl 1331.62149
[21] Kluger, Y.; Basri, R.; Chang, J. T.; Gerstein, M., Spectral biclustering of microarray data: coclustering genes and conditions, Genome Res., 13, 4, 703-716 (2003)
[22] Kolda, T. G.; Bader, B. W., Tensor decompositions and applications, SIAM Rev., 51, 3, 455-500 (2009) · Zbl 1173.65029
[23] Lee, S.; Huang, J. Z., A biclustering algorithm for binary matrices based on penalized Bernoulli likelihood, Stat. Comput., 24, 3, 429-441 (2014) · Zbl 1325.62013
[24] Lee, M.; Shen, H.; Huang, J. Z.; Marron, J., Biclustering via sparse singular value decomposition, Biometrics, 66, 4, 1087-1095 (2010) · Zbl 1233.62182
[25] Li, G.; Gaynanova, I., A general framework for association analysis of heterogeneous data, Ann. Appl. Stat., 12, 3, 1700-1726 (2018) · Zbl 1405.62068
[26] Li, G.; Huang, J. Z.; Shen, H., Exponential family functional data analysis via a low-rank model, Biometrics, 74, 4, 1301-1310 (2018)
[27] Li, X.; Xu, D.; Zhou, H.; Li, L., Tucker tensor regression and neuroimaging analysis, Stat. Biosci., 10, 3, 520-545 (2018)
[28] Moore, J. L.; Du, Z.; Bao, Z., Systematic quantification of developmental phenotypes at single-cell resolution during embryogenesis, Development, 140, 15, 3266-3274 (2013)
[29] Perrone, V.; Jenkins, P. A.; Spanò, D.; Teh, Y. W., Poisson random fields for dynamic feature models, J. Mach. Learn. Res., 18, 1, 4626-4670 (2017) · Zbl 1442.62070
[30] Pontes, B.; Giráldez, R.; Aguilar-Ruiz, J. S., Biclustering on expression data: A review, J. Biomed. Inform., 57, 163-180 (2015)
[31] Rand, W. M., Objective criteria for the evaluation of clustering methods, J. Amer. Stat. Assoc., 66, 336, 846-850 (1971)
[32] Segal, E.; Battle, A.; Koller, D., Decomposing gene expression into cellular processes, (Biocomputing 2003 (2002), World Scientific), 89-100 · Zbl 1219.92027
[33] Segal, E.; Taskar, B.; Gasch, A.; Friedman, N.; Koller, D., Rich probabilistic models for gene expression, Bioinformatics, 17, suppl_1, S243-S252 (2001)
[34] Shabalin, A. A.; Weigman, V. J.; Perou, C. M.; Nobel, A. B., Finding large average submatrices in high dimensional data, Ann. Appl. Stat., 3, 3, 985-1012 (2009) · Zbl 1196.62087
[35] Shen, H.; Huang, J. Z., Sparse principal component analysis via regularized low rank matrix approximation, J. Multivariate Anal., 99, 6, 1015-1034 (2008) · Zbl 1141.62049
[36] Sill, M.; Kaiser, S.; Benner, A.; Kopp-Schneider, A., Robust biclustering by sparse singular value decomposition incorporating stability selection, Bioinformatics, 27, 15, 2089-2097 (2011)
[37] Sun, W. W.; Lu, J.; Liu, H.; Cheng, G., Provable sparse tensor decomposition, J. R. Stat. Soc. Ser. B Stat. Methodol., 79, 3, 899-916 (2017) · Zbl 1411.62158
[38] Tan, K. M.; Witten, D. M., Sparse biclustering of transposable data, J. Comput. Graph. Statist., 23, 4, 985-1008 (2014)
[39] Tibshirani, R. J., Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B Stat. Methodol., 58, 1, 267-288 (1996) · Zbl 0850.62538
[40] Turnbull, D.; Barrington, L.; Torres, D.; Lanckriet, G., Towards musical query-by-semantic-description using the cal500 data set, (Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2007), ACM), 439-446
[41] Wang, M.; Fischer, J.; Song, Y. S., Three-way clustering of multi-tissue multi-individual gene expression data using semi-nonnegative tensor decomposition, Ann. Appl. Stat., 13, 2, 1103-1127 (2019) · Zbl 1423.62152
[42] Wang, M.; Zeng, Y., Multiway clustering via tensor block models, (Advances in Neural Information Processing Systems (2019)), 713-723
[43] Wu, T.; Benson, A. R.; Gleich, D. F., General tensor spectral co-clustering for higher-order data, (Advances in Neural Information Processing Systems (2016)), 2559-2567
[44] Zhang, C.-H., Nearly unbiased variable selection under minimax concave penalty, Ann. Statist., 38, 2, 894-942 (2010) · Zbl 1183.62120
[45] Zhang, X.; Li, L., Tensor envelope partial least-squares regression, Technometrics, 59, 4, 426-436 (2017)
[46] Zhao, H.; Wang, D. D.; Chen, L.; Liu, X.; Yan, H., Identifying multi-dimensional co-clusters in tensors based on hyperplane detection in singular vector spaces, PLoS One, 11, 9, Article e0162293 pp. (2016)
[47] Zhu, H.; Li, G.; Lock, E. F., Generalized integrative principal component analysis for multi-type data with block-wise missing structure, Biostatistics, 21, 2, 302-318 (2020)
[48] Zou, H.; Hastie, T.; Tibshirani, R., Sparse principal component analysis, J. Comput. Graph. Stat., 15, 2, 265-286 (2006)
[49] Zou, H.; Hastie, T.; Tibshirani, R., On the “degrees of freedom” of the lasso, Ann. Statist., 35, 5, 2173-2192 (2007) · Zbl 1126.62061
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.