×

Spectral clustering via sparse graph structure learning with application to proteomic signaling networks in cancer. (English) Zbl 1507.62007

Summary: Clustering methods for multivariate data exploiting the underlying geometry of the graphical structure between variables are presented. As opposed to standard approaches for graph clustering that assume known graph structures, the edge structure of the unknown graph is first estimated using sparse regression based approaches for sparse graph structure learning. Subsequently, graph clustering on the lower dimensional projections of the graph is performed based on Laplacian embeddings using a penalized k-means approach, motivated by Dirichlet process mixture models in Bayesian nonparametrics. In contrast to standard algorithmic approaches for known graphs, the proposed method allows estimation and inference for both graph structure learning and clustering. More importantly, the arguments for Laplacian embeddings as suitable projections for graph clustering are formalized by providing theoretical support for the consistency of the eigenspace of the estimated graph Laplacians. Fast computational algorithms are proposed to scale the method to large number of nodes. Extensive simulations are presented to compare the clustering performance with standard methods. The methods are applied to a novel pan-cancer proteomic data set, and protein networks and clusters are evaluated across multiple different cancer types.

MSC:

62-08 Computational methods for problems pertaining to statistics
62H12 Estimation in multivariate analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

glasso
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Akbani, R.; Ng, P. K.S.; Werner, H. M.; Shahmoradgoli, M.; Zhang, F.; Ju, Z.; Liu, W.; Yang, J. Y.; Yoshihara, K.; Li, J., A pan-cancer proteomic perspective on The Cancer Genome Atlas, Nature Commun., 5, 3887, 1-14 (2014)
[2] Arenas, A.; Fernandez, A.; Gomez, S., Analysis of the structure of complex networks at different resolution levels, New J. Phys., 10, 5, 053039 (2008)
[3] Baladandayuthapani, V.; Talluri, R.; Ji, Y.; Coombes, K. R.; Lu, Y.; Hennessy, B. T.; Davies, M. A.; Mallick, B. K., Bayesian sparse graphical models for classification with application to protein expression data, Ann. Appl. Stat., 8, 3, 1443-1468 (2014) · Zbl 1303.62057
[4] Banerjee, O.; El Ghaoui, L.; d’Aspremont, A., Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data, J. Mach. Learn. Res., 9, 485-516 (2008) · Zbl 1225.68149
[5] Belkin, M.; Niyogi, P., Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput., 15, 6, 1373-1396 (2003) · Zbl 1085.68119
[6] Betzel, R. F.; Griffa, A.; Avena-Koenigsberger, A.; Goñi, J.; Thiran, J. P.; Hagmann, P.; Sporns, O., Multi-scale community organization of the human structural connectome and its relationship with resting-state functional connectivity, Netw. Sci., 1, 3, 353-373 (2013)
[7] Bickel, P. J.; Levina, E., Covariance regularization by thresholding, Ann. Statist., 2577-2604 (2008) · Zbl 1196.62062
[8] Bickel, P. J.; Levina, E., Regularized estimation of large covariance matrices, Ann. Statist., 199-227 (2008) · Zbl 1132.62040
[9] Brennan, C. W.; Verhaak, R. G.; McKenna, A.; Campos, B.; Noushmehr, H.; Salama, S. R.; Zheng, S.; Chakravarty, D.; Sanborn, J. Z.; Berman, S. H., The somatic genomic landscape of glioblastoma, Cell, 155, 2, 462-477 (2013)
[10] Cai, T.; Liu, W.; Luo, X., A constrained \(\ell_1\)-minimization approach to sparse precision matrix estimation, J. Amer. Statist. Assoc., 106, 494, 594-607 (2011) · Zbl 1232.62087
[11] Coifman, R. R.; Lafon, S., Diffusion maps, Appl. Comput. Harmon. Anal., 21, 1, 5-30 (2006) · Zbl 1095.68094
[12] Creasman, W. T., Prognostic significance of hormone receptors in endometrial cancer, Cancer, 71, S4, 1467-1470 (1993)
[13] Davies, M.; Hennessy, B.; Mills, G. B., Point mutations of protein kinases and individualised cancer therapy, Expert Opin. Pharmacother., 7, 16, 2243-2261 (2006)
[14] Davis, C.; Kahan, W. M., The rotation of eigenvectors by a perturbation. III, SIAM J. Numer. Anal., 7, 1, 1-46 (1970) · Zbl 0198.47201
[15] Donath, W. E.; Hoffman, A. J., Lower bounds for the partitioning of graphs, IBM J. Res. Dev., 17, 5, 420-425 (1973) · Zbl 0259.05112
[16] Fan, J.; Feng, Y.; Wu, Y., Network exploration via the adaptive LASSO and SCAD penalties, Ann. Appl. Stat., 3, 2, 521 (2009) · Zbl 1166.62040
[17] Fiedler, M., Algebraic connectivity of graphs, Czechoslovak Math. J., 23, 2, 298-305 (1973) · Zbl 0265.05119
[18] Friedman, J.; Hastie, T.; Tibshirani, R., Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, 3, 432-441 (2008) · Zbl 1143.62076
[19] Guindani, M.; Johnson, W. O., More nonparametric Bayesian inference in applications, Stat. Methods Appl., 27, 2, 239-251 (2018) · Zbl 1396.62060
[20] Guo, J.; Levina, E.; Michailidis, G.; Zhu, J., Joint estimation of multiple graphical models, Biometrika, 98, 1, 1-15 (2011) · Zbl 1214.62058
[21] Halaban, R.; Zhang, W.; Bacchiocchi, A.; Cheng, E.; Parisi, F.; Ariyan, S.; Krauthammer, M.; McCusker, J.; Kluger, Y.; Sznol, M., PLX4032, a selective BRAF V600E kinase inhibitor, activates the ERK pathway and enhances cell migration and proliferation of BRAF WT melanoma cells, Pigment Cell Melanoma Res., 23, 2, 190-200 (2010)
[22] Hennessy, B. T.; Lu, Y.; Gonzalez-Angulo, A. M.; Carey, M. S.; Myhre, S.; Ju, Z.; Davies, M. A.; Liu, W.; Coombes, K.; Meric-Bernstam, F.; Bedrosian, I.; McGahren, M.; Agarwal, R.; Zhang, F.; Overgaard, J.; Alsner, J.; Neve, R. M.; Kuo, W. L.; Gray, J. W.; Borresen-Dale, A. L.; Mills, G. B., A technical assessment of the utility of reverse phase protein arrays for the study of the functional proteome in non-microdissected human breast cancers, Clin. Proteom., 6, 4, 129-151 (2010)
[23] Huang, J. Z.; Liu, N.; Pourahmadi, M.; Liu, L., Covariance matrix selection and estimation via penalised normal likelihood, Biometrika, 93, 1, 85-98 (2006) · Zbl 1152.62346
[24] Karoui, N. E., Operator norm consistent estimation of large-dimensional sparse covariance matrices, Ann. Statist., 2717-2756 (2008) · Zbl 1196.62064
[25] Kulis, B., Jordan, M.I., Revisiting k-means: New algorithms via Bayesian nonparametrics, in: Proceedings of the 29th International Conference on Machine Learning, ICML-12, 2012, pp. 513-520.; Kulis, B., Jordan, M.I., Revisiting k-means: New algorithms via Bayesian nonparametrics, in: Proceedings of the 29th International Conference on Machine Learning, ICML-12, 2012, pp. 513-520.
[26] Lambiotte, R., Multi-scale modularity in complex networks, (2010 Proceedings of the 8th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks, WiOpt (2010), IEEE), 546-553
[27] Lauritzen, S., Graphical Models (1996), Clarendon Press, Oxford · Zbl 0907.62001
[28] Ledoit, O.; Wolf, M., A well-conditioned estimator for large-dimensional covariance matrices, J. Multivar. Anal., 88, 2, 365-411 (2004) · Zbl 1032.62050
[29] Liang, H.; Cheung, L. W.; Li, J.; Ju, Z.; Yu, S.; Stemke-Hale, K.; Dogruluk, T.; Lu, Y.; Liu, X.; Gu, C.; Guo, W.; Scherer, S. E.; Carter, H.; Westin, S. N.; Dyer, M. D.; Verhaak, R. G.; Zhang, F.; Karchin, R.; Liu, C. G.; Lu, K. H.; Broaddus, R. R.; Scott, K. L.; Hennessy, B. T.; Mills, G. B., Whole-exome sequencing combined with functional genomics reveals novel candidate driver cancer genes in endometrial cancer, Genome Res., 22, 11, 2120-2129 (2012)
[30] Majid, S.; Dar, A. A.; Ahmad, A. E.; Hirata, H.; Kawakami, K.; Shahryari, V.; Saini, S.; Tanaka, Y.; Dahiya, A. V.; Khatri, G., BTG3 tumor suppressor gene promoter demethylation, histone modification and cell cycle arrest by genistein in renal cancer, Carcinogenesis, 30, 4, 662-670 (2009)
[31] Meinshausen, N.; Bühlmann, P., High-dimensional graphs and variable selection with the lasso, Ann. Statist., 34, 3, 1436-1462 (2006) · Zbl 1113.62082
[32] Miyamoto, H.; Izumi, K.; Yao, J. L.; Li, Y.; Yang, Q.; McMahon, L. A.; Gonzalez-Roibon, N.; Hicks, D. G.; Tacha, D.; Netto, G. J., GATA binding protein 3 is down-regulated in bladder cancer yet strong expression is an independent predictor of poor prognosis in invasive tumor, Hum. Pathol., 43, 11, 2033-2040 (2012)
[33] Müeller, P.; Quintana, F. A.; Page, G., Nonparametric Bayesian inference in applications, Stat. Methods Appl., 27, 2, 175-206 (2018) · Zbl 1396.62067
[34] Network, C. G.A., Comprehensive molecular portraits of human breast tumours, Nature, 490, 7418, 61-70 (2012)
[35] Network, C. G.A. R., Comprehensive molecular characterization of urothelial bladder carcinoma, Nature, 507, 7492, 315-322 (2014)
[36] Newton, M. A.; Noueiry, A.; Sarkar, D.; Ahlquist, P., Detecting differential gene expression with a semiparametric hierarchical mixture method, Biostatistics, 5, 2, 155-176 (2004) · Zbl 1096.62124
[37] Ng, A. Y.; Jordan, M. I.; Weiss, Y., On spectral clustering: Analysis and an algorithm, Adv. Neural Inf. Process. Syst., 2, 849-856 (2002)
[38] Nishizuka, S.; Charboneau, L.; Young, L.; Major, S.; Reinhold, W. C.; Waltham, M.; Kouros-Mehr, H.; Bussey, K. J.; Lee, J. K.; Espina, V.; Munson, P. J.; Petricoin, E.; Liotta, L. A.; Weinstein, J. N., Proteomic profiling of the NCI-60 cancer cell lines using new high-density reverse-phase lysate microarrays, Proc. Natl. Acad. Sci. USA, 100, 24, 14229-14234 (2003)
[39] Paweletz, C.; Charboneau, L.; Bichsel, V.; Simone, N.; Chen, T.; Gillespie, J.; Emmert-Buck, M.; Roth, M.; Petricoin, E.; Liotta, L., Reverse phase protein microarrays which capture disease progression show activation of pro-survival pathways at the cancer invasion front, Oncogene, 20, 16, 1981-1989 (2001)
[40] Peng, J.; Wang, P.; Zhou, N.; Zhu, J., Partial correlation estimation by joint sparse regression models, J. Amer. Statist. Assoc., 104, 486, 735-746 (2009) · Zbl 1388.62046
[41] Perou, C. M.; Sørlie, T.; Eisen, M. B.; van de Rijn, M.; Jeffrey, S. S.; Rees, C. A.; Pollack, J. R.; Ross, D. T.; Johnsen, H.; Akslen, L. A., Molecular portraits of human breast tumours, Nature, 406, 6797, 747-752 (2000)
[42] Reichardt, J.; Bornholdt, S., Statistical mechanics of community detection, Phys. Rev. E, 74, 1, 016110 (2006)
[43] Richiardi, J.; Achard, S.; Bunke, H.; Van De Ville, D., Machine learning with brain graphs: Predictive modeling approaches for functional imaging in systems neuroscience, IEEE Signal Process. Mag., 30, 3, 58-70 (2013)
[44] Rohe, K.; Chatterjee, S.; Yu, B., Spectral clustering and the high-dimensional stochastic blockmodel, Ann. Statist., 39, 4, 1878-1915 (2011) · Zbl 1227.62042
[45] Rothman, A.; Bickel, P.; Levina, E.; Zhu, J., Sparse permutation invariant covariance estimation, Electron. J. Stat., 2, 494-515 (2008) · Zbl 1320.62135
[46] Rothman, A. J.; Levina, E.; Zhu, J., Generalized thresholding of large covariance matrices, J. Amer. Statist. Assoc., 104, 485, 177-186 (2009) · Zbl 1388.62170
[47] Schaub, M. T.; Delvenne, J. C.; Yaliraki, S. N.; Barahona, M., Markov dynamics as a zooming lens for multiscale community detection: non clique-like communities and the field-of-view limit, PLoS One, 7, 2, Article e32210 pp. (2012)
[48] Schiebinger, G.; Wainwright, M. J.; Yu, B., The geometry of kernelized spectral clustering, Ann. Statist., 43, 2, 819-846 (2015) · Zbl 1312.62082
[49] Sethuraman, J., A constructive definition of Dirichlet priors, Statist. Sinica, 4, 639-650 (1994) · Zbl 0823.62007
[50] Sheehan, K. M.; Calvert, V. S.; Kay, E. W.; Lu, Y.; Fishman, D.; Espina, V.; Aquino, J.; Speer, R.; Araujo, R.; Mills, G. B.; Liotta, L. A.; Petricoin, E. F.; Wulfkuhle, J. D., Use of reverse phase protein microarrays and reference standard development for molecular network analysis of metastatic ovarian carcinoma, Mol. Cell. Proteom., 4, 4, 346-355 (2005)
[51] Shi, J.; Malik, J., Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell., 22, 8, 888-905 (2000)
[52] Sorlie, T.; Perou, C. M.; Tibshirani, R.; Aas, T.; Geisler, S.; Johnsen, H.; Hastie, T.; Eisen, M.; Van de Rijn, M.; Jeffrey, S., Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications, Proc. Natl. Acad. Sci. USA, 98, 19, 10869-10874 (2001)
[53] Spurrier, B.; Ramalingam, S.; Nishizuka, S., Reverse-phase protein lysate microarrays for cell signaling analysis, Nat. Protoc., 3, 11, 1796-1808 (2008)
[54] Tibes, R.; Qiu, Y.; Lu, Y.; Hennessy, B.; Andreeff, M.; Mills, G. B.; Kornblau, S. M., Reverse phase protein array: validation of a novel proteomic technology and utility for analysis of primary leukemia specimens and hematopoietic stem cells, Mol. Cancer Therapeutics, 5, 10, 2512-2521 (2006)
[55] Tremblay, N.; Borgnat, P., Graph wavelets for multiscale community mining, IEEE Trans. Signal Process., 62, 20, 5227-5239 (2014) · Zbl 1394.94602
[56] Verhaak, R. G.; Hoadley, K. A.; Purdom, E.; Wang, V.; Qi, Y.; Wilkerson, M. D.; Miller, C. R.; Ding, L.; Golub, T.; Mesirov, J. P., Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1, Cancer Cell, 17, 1, 98-110 (2010)
[57] Von Luxburg, U., A tutorial on spectral clustering, Stat. Comput., 17, 4, 395-416 (2007)
[58] Wang, H., Bayesian graphical lasso models and efficient posterior computation, Bayesian Anal., 7, 4, 867-886 (2012) · Zbl 1330.62041
[59] Witten, D. M.; Friedman, J. H.; Simon, N., New insights and faster computations for the graphical lasso, J. Comput. Graph. Statist., 20, 4, 892-900 (2011)
[60] Yang, J. Y.; Yoshihara, K.; Tanaka, K.; Hatae, M.; Masuzaki, H.; Itamochi, H.; Takano, M.; Ushijima, K.; Tanyi, J. L.; Coukos, G.; Lu, Y.; Mills, G. B.; Verhaak, R. G., Predicting time to ovarian carcinoma recurrence using protein markers, J. Clin. Investig., 123, 9, 3740-3750 (2013)
[61] Yu, Y.; Wang, T.; Samworth, R. J., A useful variant of the Davis-Kahan theorem for statisticians, Biometrika, 102, 2, 315-323 (2015) · Zbl 1452.15010
[62] Yuan, M.; Lin, Y., Model selection and estimation in the Gaussian graphical model, Biometrika, 94, 1, 19-35 (2007) · Zbl 1142.62408
[63] Zhang, L.; Wei, Q.; Mao, L.; Liu, W.; Mills, G. B.; Coombes, K., Serial dilution curve: a new method for analysis of reverse phase protein array data, Bioinformatics, 25, 5, 650-654 (2009)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.