×

The reduced PC-algorithm: improved causal structure learning in large random networks. (English) Zbl 1446.62256

Summary: We consider the task of estimating a high-dimensional directed acyclic graph, given observations from a linear structural equation model with arbitrary noise distribution. By exploiting properties of common random graphs, we develop a new algorithm that requires conditioning only on small sets of variables. The proposed algorithm, which is essentially a modified version of the PC-Algorithm, offers significant gains in both computational complexity and estimation accuracy. In particular, it results in more efficient and accurate estimation in large networks containing hub nodes, which are common in biological systems. We prove the consistency of the proposed algorithm, and show that it also requires a less stringent faithfulness assumption than the PC-Algorithm. Simulations in low and high-dimensional settings are used to illustrate these findings. An application to gene expression data suggests that the proposed algorithm can identify a greater number of clinically relevant genes than current methods.

MSC:

62M40 Random fields; image analysis
62H22 Probabilistic graphical models
60G60 Random fields
05C90 Applications of graph theory
62-08 Computational methods for problems pertaining to statistics
62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

BioGRID; PenPC; bnlearn
PDFBibTeX XMLCite
Full Text: arXiv Link

References:

[1] Silvia Acid and Luis M de Campos. Approximations of causal networks by polytrees: an empirical study. InAdvances in Intelligent Computing, pages 149-158. Springer, 1994.
[2] Animashree Anandkumar, Vincent YF Tan, Furong Huang, and Alan S Willsky. Highdimensional Gaussian graphical model selection: walk summability and local separation criterion.Journal of Machine Learning Research, 13(1):2293-2337, 2012. · Zbl 1433.68322
[3] Cancer Genome Atlas Research Network. Comprehensive genomic characterization of squamous cell lung cancers.Nature, 489(7417):519-525, 2012.
[4] Marc RJ Carlson, Bin Zhang, Zixing Fang, Paul S Mischel, Steve Horvath, and Stanley F Nelson. Gene connectivity, function, and sequence conservation: predictions from modular yeast co-expression networks.BMC Genomics, 7(1):40, 2006.
[5] Hao Chen and Burt M Sharp.Content-rich biological network constructed by mining pubmed abstracts.BMC Bioinformatics, 5(1):147, 2004.
[6] David Maxwell Chickering. Learning Bayesian networks is NP-complete. InLearning From Data, pages 121-130. Springer, 1996.
[7] Richard Durrett.Random Graph Dynamics, volume 200. Cambridge University Press, 2007. · Zbl 1116.05001
[8] Daniel Eaton and Kevin Murphy. Bayesian structure learning using dynamic programming and MCMC.arXiv preprint arXiv:1206.5247, 2012.
[9] Rina Foygel and Mathias Drton.Extended Bayesian information criteria for Gaussian graphical models. InNeural Information Processing Systems (NeurIPS), pages 604-612, 2010.
[10] Nir Friedman, Michal Linial, and Iftach Nachman. Using Bayesian networks to analyze expression data.Journal of Computational Biology, 7:601-620, 2000.
[11] Min Jin Ha, Wei Sun, and Jichun Xie. PenPC: a two-step approach to estimate the skeletons of high-dimensional directed acyclic graphs.Biometrics, 72(1):146-155, 2016. · Zbl 1393.62065
[12] S Horvath, B Zhang, M Carlson, KV Lu, S Zhu, RM Felciano, MF Laurance, W Zhao, S Qi, Z Chen, et al. Analysis of oncogenic signaling networks in glioblastoma identifies aspm as a molecular target.Proceedings of the National Academy of Sciences, 103(46): 17402-17407, 2006.
[13] Juan F. Huete and Luis M. Campos. Learning causal polytrees. InSymbolic and Quantitative Approaches to Reasoning and Uncertainty, pages 180-185. Springer, 1993. doi: 10.1007/BFb0028199.
[14] Trey Ideker and Nevan J Krogan. Differential network biology.Molecular Systems Biology, 8(1):565, 2012.
[15] Ronald Jansen, Haiyuan Yu, Dov Greenbaum, Yuval Kluger, Nevan J. Krogan, Sambath Chung, Andrew Emili, Michael Snyder, Jack F. Greenblatt, and Mark Gerstein. A Bayesian networks approach for predicting protein-protein interactions from genomic data.Science, 302(5644):449-453, 2003. ISSN 0036-8075. doi: 10.1126/science.1087361.
[16] Hawoong Jeong, B´alint Tombor, R´eka Albert, Zoltan N Oltvai, and A-L Barab´asi. The large-scale organization of metabolic networks.Nature, 407(6804):651-654, 2000.
[17] Hawoong Jeong, Sean P Mason, A-L Barab´asi, and Zoltan N Oltvai. Lethality and centrality in protein networks.Nature, 411(6833):41-42, 2001.
[18] Markus Kalisch and Peter B¨uhlmann. Estimating high-dimensional directed acyclic graphs with the PC-Algorithm.Journal of Machine Learning Research, 8:613-636, May 2007. ISSN 1532-4435. · Zbl 1222.68229
[19] Jon Kleinberg. The small-world phenomenon: an algorithmic perspective. InACM Symposium on Theory of Computing, pages 163-170. ACM, 2000. · Zbl 1296.05181
[20] Bide Liu, Xiao Gu, Tianbao Huang, Yang Luan, and Xuefei Ding. Identification of tmprss2erg mechanisms in prostate cancer invasiveness: involvement of mmp-9 and plexin b1. Oncology Reports, 37(1):201-208, 2017.
[21] Po-Ling Loh and Peter B¨uhlmann. High-dimensional learning of linear causal networks via inverse covariance estimation.Journal of Machine Learning Research, 15(1):3065-3105, 2014. · Zbl 1318.68148
[22] Dmitry M Malioutov, Jason K Johnson, and Alan S Willsky. Walk-sums and belief propagation in Gaussian graphical models.Journal of Machine Learning Research, 7(Oct): 2031-2064, 2006. · Zbl 1222.68254
[23] Michael Molloy and Bruce Reed. A critical point for random graphs with a given degree sequence.Random Structures & Algorithms, 6(2-3):161-180, 1995. · Zbl 0823.05050
[24] Preetam Nandy, Alain Hauser, and Marloes H Maathuis. High-dimensional consistency in score-based and hybrid structure learning.arXiv preprint arXiv:1507.02608, 2015. · Zbl 1411.62144
[25] Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Sch¨olkopf. Causal discovery with continuous additive noise models.Journal of Machine Learning Research, 15:2009- 2053, 2014. · Zbl 1318.68151
[26] Leonie Ratz, Mark Laible, Lukasz A Kacprzyk, Stephanie M Wittig-Blaich, Yanis Tolstov, Stefan Duensing, Peter Altevogt, Sabine M Klauck, and Holger S¨ultmann. Tmprss2: Erg gene fusion variants induce tgf-βsignaling and epithelial to mesenchymal transition in human prostate cancer cells.Oncotarget, 8(15):25115, 2017.
[27] Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, and Bin Yu.Highdimensional covariance estimation by minimizing l1-penalized log-determinant divergence.Electronic Journal of Statistics, 5:935-980, 2011. · Zbl 1274.62190
[28] Christiane M Robbins, Waibov A Tembe, Angela Baker, Shripad Sinari, Tracy Y Moses, Stephen Beckstrom-Sternberg, James Beckstrom-Sternberg, Michael Barrett, James Long, Arul Chinnaiyan, et al. Copy number and targeted mutational analysis reveals novel somatic events in metastatic prostate tumors.Genome Research, 21(1):47-55, 2011.
[29] Roberta R Ruela-de Sousa, Elmer Hoekstra, A Marije Hoogland, Karla C Souza Queiroz, Maikel P Peppelenbosch, Andrew P Stubbs, Karin Pelizzaro-Rocha, Geert JLH van Leenders, Guido Jenster, Hiroshi Aoyama, et al. Low-molecular-weight protein tyrosine phosphatase predicts prostate cancer outcome by increasing the metastatic potential.European Urology, 69(4):710-719, 2016.
[30] Eric E Schadt, John Lamb, Xia Yang, Jun Zhu, Steve Edwards, Debraj GuhaThakurta, Solveig K Sieberts, Stephanie Monks, Marc Reitman, Chunsheng Zhang, et al. An integrative genomics approach to infer causal associations between gene expression and disease.Nature Genetics, 37(7):710-717, 2005.
[31] Mark Schmidt, Alexandru Niculescu-Mizil, and Kevin Murphy. Learning graphical model structure using l1-regularization paths. InAssociation for the Advancement of Artificial Intelligence (AAAI), volume 7, pages 1278-1283, 2007.
[32] Marco Scutari. Learning Bayesian networks with the bnlearn r package.Journal of Statistical Software, 35(1):1-22, 2010. ISSN 1548-7660. doi: 10.18637/jss.v035.i03.
[33] A. Shojaie, A. Jauhiainen, M. Kallitsis, and G. Michailidis. Inferring Regulatory Networks by Combining Perturbation Screens and Steady State Gene Expression Profiles.PLoS One, 9(2):e82393, 2014.
[34] Ali Shojaie and George Michailidis. Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs.Biometrika, 97(3):519-538, 2010. · Zbl 1195.62090
[35] P. Spirtes, C. Glymour, and R. Scheines.Causation, Prediction, and Search. MIT press, 2nd edition, 2000. · Zbl 0806.62001
[36] Chris Stark, Bobby-Joe Breitkreutz, Teresa Reguly, Lorrie Boucher, Ashton Breitkreutz, and Mike Tyers. Biogrid: a general repository for interaction datasets.Nucleic Acids Research, 34(suppl 1):D535-D539, 2006.
[37] Seth Sullivant, Kelli Talaska, and Jan Draisma. Trek separation for Gaussian graphical models.The Annals of Statistics, pages 1665-1685, 2010. · Zbl 1189.62091
[38] Liang Sun, Jiaju L¨u, Sentai Ding, Dongbin Bi, Kejia Ding, Zhihong Niu, and Ping Liu. Hcrp1 regulates proliferation, invasion, and drug resistance via egfr signaling in prostate cancer.Biomedicine & Pharmacotherapy, 91:202-207, 2017.
[39] Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing Bayesian network structure learning algorithm.Machine Learning, 65(1):31-78, 2006. · Zbl 1470.68192
[40] Caroline Uhler, Garvesh Raskutti, Peter B¨uhlmann, and Bin Yu. Geometry of the faithfulness assumption in causal inference.The Annals of Statistics, 41(2):436-463, 2013. · Zbl 1267.62068
[41] Arend Voorman, Ali Shojaie, and Daniela Witten. Graph estimation with joint additive models.Biometrika, 101(1):85, 2014. · Zbl 1285.62061
[42] Changwon Yoo, Vesteinn Thorsson, and Gregory F Cooper. Discovery of causal relationships in a gene-regulation pathway from a mixture of experimental and observational DNA microarray data. InPacific Symposium on Biocomputing, volume 7, pages 498-509, 2002.
[43] Jiji Zhang and Peter Spirtes.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.