×

Sublinear-time algorithms for counting star subgraphs via edge sampling. (English) Zbl 1391.68120

Summary: We study the problem of estimating the value of sums of the form \(S_p\triangleq\sum\binom {x_i}{p}\) when one has the ability to sample \(x_i\geq 0\) with probability proportional to its magnitude. When \(p=2\), this problem is equivalent to estimating the selectivity of a self-join query in database systems when one can sample rows randomly. We also study the special case when \(\{x_i\}\) is the degree sequence of a graph, which corresponds to counting the number of \(p\)-stars in a graph when one has the ability to sample edges randomly. Our algorithm for a \((1 \pm \epsilon )\)-multiplicative approximation of \(S_p\) has query and time complexities \(O\left(\frac{m\log\log n}{\epsilon^2S_p^{1/p}}\right)\). Here, \(m=\sum x_i/2\) is the number of edges in the graph, or equivalently, half the number of records in the database table. Similarly, \(n\) is the number of vertices in the graph and the number of unique values in the database table. We also provide tight lower bounds (up to polylogarithmic factors) in almost all cases, even when \(\{x_i\}\) is a degree sequence and one is allowed to use the structure of the graph to try to get a better estimate. We are not aware of any prior lower bounds on the problem of join selectivity estimation. For the graph problem, prior work which assumed the ability to sample only vertices uniformly gave algorithms with matching lower bounds [M. Gonen et al., SIAM J. Discrete Math. 25, No. 3, 1365–1411 (2011; Zbl 1234.68138)]. With the ability to sample edges randomly, we show that one can achieve faster algorithms for approximating the number of star subgraphs, bypassing the lower bounds in this prior work. For example, in the regime where \(S_p\leq n\), and \(p=2\), our upper bound is \(\tilde{O}(n/S_p^{1/2})\), in contrast to their \(\varOmega(n/S_p^{1/3})\) lower bound when no random edge queries are available. In addition, we consider the problem of counting the number of directed paths of length two when the graph is directed. This problem is equivalent to estimating the selectivity of a join query between two distinct tables. We prove that the general version of this problem cannot be solved in sublinear time. However, when the ratio between in-degree and out-degree is bounded – or equivalently, when the ratio between the number of occurrences of values in the two columns being joined is bounded – we give a sublinear time algorithm via a reduction to the undirected case.

MSC:

68W20 Randomized algorithms
05C30 Enumeration in graph theory
05C85 Graph algorithms (graph-theoretic aspects)
68P15 Database theory
68Q25 Analysis of algorithms and problem complexity
68R10 Graph theory (including graph drawing) in computer science

Citations:

Zbl 1234.68138

Software:

QPath
PDFBibTeX XMLCite
Full Text: DOI arXiv Link

References:

[1] Ahn, K.J., Guha, S., McGregor, A.: Graph sketches: sparsification, spanners, and subgraphs. In: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20-24, 2012, pp. 5-14 (2012). doi:10.1145/2213556.2213560
[2] Alon, N., Dao, P., Hajirasouliha, I., Hormozdiari, F., Sahinalp, S.C.: Biomolecular network motif counting and discovery by color coding. In: Proceedings 16th International Conference on Intelligent Systems for Molecular Biology (ISMB), Toronto, Canada, July 19-23, 2008, pp. 241-249 (2008). doi:10.1093/bioinformatics/btn163
[3] Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking join and self-join sizes in limited storage. J. Comput. Syst. Sci. 64(3), 719-747 (2002). doi:10.1006/jcss.2001.1813 · Zbl 1051.68136 · doi:10.1006/jcss.2001.1813
[4] Alon, N., Gutner, S.: Balanced hashing, color coding and approximate counting. In: Parameterized and Exact Computation, 4th International Workshop, IWPEC 2009, Copenhagen, Denmark, September 10-11, 2009, Revised Selected Papers, pp. 1-16 (2009). doi:10.1007/978-3-642-11269-0_1 · Zbl 1273.68270
[5] Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137-147 (1999). doi:10.1006/jcss.1997.1545 · Zbl 0938.68153 · doi:10.1006/jcss.1997.1545
[6] Alon, N., Yuster, R., Zwick, U.: Finding and counting given length cycles. Algorithmica 17(3), 209-223 (1997). doi:10.1007/BF02523189 · Zbl 0865.68093 · doi:10.1007/BF02523189
[7] Amini, O., Fomin, F.V., Saurabh, S.: Counting subgraphs via homomorphisms. SIAM J. Discrete Math. 26(2), 695-717 (2012). doi:10.1137/100789403 · Zbl 1248.05122 · doi:10.1137/100789403
[8] Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Reductions in streaming algorithms, with an application to counting triangles in graphs. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6-8, 2002, San Francisco, CA, USA., pp. 623-632 (2002). http://dl.acm.org/citation.cfm?id=545381.545464 · Zbl 1094.68601
[9] Batu, T., Berenbrink, P., Sohler, C.: A sublinear-time approximation scheme for bin packing. Theor. Comput. Sci. 410(47-49), 5082-5092 (2009). doi:10.1016/j.tcs.2009.08.006 · Zbl 1194.68254 · doi:10.1016/j.tcs.2009.08.006
[10] Becchetti, L., Boldi, P., Castillo, C., Gionis, A.: Efficient semi-streaming algorithms for local triangle counting in massive graphs. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, pp. 16-24 (2008). doi:10.1145/1401890.1401898 · Zbl 1194.68254
[11] Bhuvanagiri, L., Ganguly, S., Kesh, D., Saha, C.: Simpler algorithm for estimating frequency moments of data streams. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 708-713. ACM (2006) · Zbl 1192.94080
[12] Blais, E., Brody, J., Matulef, K.: Property testing lower bounds via communication complexity. Comput. Complex. 21(2), 311-358 (2012). doi:10.1007/s00037-012-0040-x · Zbl 1253.68142 · doi:10.1007/s00037-012-0040-x
[13] Buriol, L.S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., Sohler, C.: Counting triangles in data streams. In: Proceedings of the Twenty-Fifth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 26-28, 2006, Chicago, Illinois, USA, pp. 253-262 (2006). doi:10.1145/1142351.1142388
[14] Canonne, C.L., Rubinfeld, R.: Testing probability distributions underlying aggregated data. In: Automata, Languages, and Programming—41st International Colloquium, ICALP 2014, Copenhagen, Denmark, July 8-11, 2014, Proceedings, Part I, pp. 283-295 (2014). doi:10.1007/978-3-662-43948-7_24 · Zbl 1409.68326
[15] Coppersmith, D., Kumar, R.: An improved data stream algorithm for frequency moments. In: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, New Orleans, Louisiana, USA, January 11-14, 2004, pp. 151-156 (2004). http://dl.acm.org/citation.cfm?id=982792.982815 · Zbl 1317.68266
[16] Duke, R.A., Lefmann, H., Rödl, V.: A fast approximation algorithm for computing the frequencies of subgraphs in a given graph. SIAM J. Comput. 24(3), 598-620 (1995). doi:10.1137/S0097539793247634 · Zbl 0831.68049 · doi:10.1137/S0097539793247634
[17] Eden, T., Levi, A., Ron, D., Seshadhri, C.: Approximately counting triangles in sublinear time. In: IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pp. 614-633 (2015). doi:10.1109/FOCS.2015.44 · Zbl 1380.68445
[18] Feige, U.: On sums of independent random variables with unbounded variance and estimating the average degree in a graph. SIAM J. Comput. 35(4), 964-984 (2006). doi:10.1137/S0097539704447304 · Zbl 1098.60026 · doi:10.1137/S0097539704447304
[19] Flum, J., Grohe, M.: The parameterized complexity of counting problems. SIAM J. Comput. 33(4), 892-922 (2004). doi:10.1137/S0097539703427203 · Zbl 1105.68042 · doi:10.1137/S0097539703427203
[20] Fomin, F.V., Lokshtanov, D., Raman, V., Saurabh, S., Rao, B.V.R.: Faster algorithms for finding and counting subgraphs. J. Comput. Syst. Sci. 78(3), 698-706 (2012). doi:10.1016/j.jcss.2011.10.001 · Zbl 1246.05149 · doi:10.1016/j.jcss.2011.10.001
[21] Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, May 21-24, 2001, pp. 461-472 (2001). doi:10.1145/375663.375727 · Zbl 1253.68380
[22] Goldreich, O.: On the communication complexity methodology for proving lower bounds on the query complexity of property testing. Electron. Colloq. Comput. Complex. (ECCC) 20, 73 (2013). http://eccc.hpi-web.de/report/2013/073 · Zbl 07578384
[23] Goldreich, O., Ron, D.: Approximating average parameters of graphs. Random Struct. Algorithms 32(4), 473-493 (2008). doi:10.1002/rsa.20203 · Zbl 1155.05057 · doi:10.1002/rsa.20203
[24] Gonen, M., Ron, D., Shavitt, Y.: Counting stars and other small subgraphs in sublinear-time. SIAM J. Discrete Math. 25(3), 1365-1411 (2011). doi:10.1137/100783066 · Zbl 1234.68138 · doi:10.1137/100783066
[25] Gonen, M., Shavitt, Y.: Approximating the number of network motifs. Internet Math. 6(3), 349-372 (2009). doi:10.1080/15427951.2009.10390645 · Zbl 1239.68056 · doi:10.1080/15427951.2009.10390645
[26] Grochow, J.A., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. In: Research in Computational Molecular Biology, 11th Annual International Conference, RECOMB 2007, Oakland, CA, USA, April 21-25, 2007, Proceedings, pp. 92-106 (2007). doi:10.1007/978-3-540-71681-5_7
[27] Haas, P.J., Ilyas, I.F., Lohman, G.M., Markl, V.: Discovering and exploiting statistical properties for query optimization in relational databases: a survey. Stat. Anal. Data Min. 1(4), 223-250 (2009). doi:10.1002/sam.10016 · Zbl 07260197 · doi:10.1002/sam.10016
[28] Haas, P.J., Naughton, J.F., Seshadri, S., Swami, A.N.: Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci. 52(3), 550-569 (1996). doi:10.1006/jcss.1996.0041 · Zbl 0858.68034 · doi:10.1006/jcss.1996.0041
[29] Hales, D., Arteconi, S.: Motifs in evolving cooperative networks look like protein structure networks. NHM 3(2), 239-249 (2008). doi:10.3934/nhm.2008.3.239 · Zbl 1142.68312 · doi:10.3934/nhm.2008.3.239
[30] Hassidim, A., Kelner, J.A., Nguyen, H.N., Onak, K.: Local graph partitions for approximation and testing. In: 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, October 25-27, 2009, Atlanta, Georgia, USA, pp. 22-31 (2009). doi:10.1109/FOCS.2009.77 · Zbl 1292.68126
[31] Hormozdiari, F., Berenbrink, P., Przulj, N., Sahinalp, S.C.: Not all scale-free networks are born equal: The role of the seed graph in PPI network evolution. PLoS Comput Biol (2007). doi:10.1371/journal.pcbi.0030118 · doi:10.1371/journal.pcbi.0030118
[32] Indyk, P., Woodruff, D.P.: Optimal approximations of the frequency moments of data streams. In: Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22-24, 2005, pp. 202-208 (2005). doi:10.1145/1060590.1060621 · Zbl 1192.68364
[33] Kane, D.M., Mehlhorn, K., Sauerwald, T., Sun, H.: Counting arbitrary subgraphs in data streams. In: Automata, Languages, and Programming—39th International Colloquium, ICALP 2012, Warwick, UK, July 9-13, 2012, Proceedings, Part II, pp. 598-609 (2012). doi:10.1007/978-3-642-31585-5_53 · Zbl 1367.68213
[34] Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11), 1746-1758 (2004). doi:10.1093/bioinformatics/bth163 · doi:10.1093/bioinformatics/bth163
[35] Kolountzakis, M.N., Miller, G.L., Peng, R., Tsourakakis, C.E.: Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Math. 8(1-2), 161-185 (2012). doi:10.1080/15427951.2012.625260 · Zbl 1245.05120 · doi:10.1080/15427951.2012.625260
[36] Lee, J., Kim, D., Chung, C.: Multi-dimensional selectivity estimation using compressed histogram information. In: SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadelphia, Pennsylvania, USA., pp. 205-214 (1999). doi:10.1145/304182.304200 · Zbl 1248.05122
[37] Manjunath, M., Mehlhorn, K., Panagiotou, K., Sun, H.: Approximate counting of cycles in streams. In: Algorithms—ESA 2011—19th Annual European Symposium, Saarbrücken, Germany, September 5-9, 2011. Proceedings, pp. 677-688 (2011). doi:10.1007/978-3-642-23719-5_57 · Zbl 1346.68257
[38] Markl, V., Haas, P.J., Kutsch, M., Megiddo, N., Srivastava, U., Tran, T.M.: Consistent selectivity estimation via maximum entropy. VLDB J. 16(1), 55-76 (2007). doi:10.1007/s00778-006-0030-1 · doi:10.1007/s00778-006-0030-1
[39] McGregor, A., Vorotnikova, S., Vu, H.T.: Better algorithms for counting triangles in data streams. In: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26-July 01, 2016, pp. 401-411 (2016). doi:10.1145/2902251.2902283 · Zbl 1155.05057
[40] Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824-827 (2002) · doi:10.1126/science.298.5594.824
[41] Motwani, R., Panigrahy, R., Xu, Y.: Estimating sum by weighted sampling. In: Automata, Languages and Programming, 34th International Colloquium, ICALP 2007, Wroclaw, Poland, July 9-13, 2007, Proceedings, pp. 53-64 (2007). doi:10.1007/978-3-540-73420-8_7 · Zbl 1171.68866
[42] Motwani, R., Raghavan, P.: Randomized Algorithms. Chapman & Hall/CRC, London (2010) · Zbl 0849.68039
[43] Nguyen, H.N., Onak, K.: Constant-time approximation algorithms via local improvements. In: 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008, October 25-28, 2008, Philadelphia, PA, USA, pp. 327-336 (2008). doi:10.1109/FOCS.2008.81
[44] Onak, K., Ron, D., Rosen, M., Rubinfeld, R.: A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, January 17-19, 2012, pp. 1123-1131 (2012).http://portal.acm.org/citation.cfm?id=2095204&CFID=63838676&CFTOKEN=79617016 · Zbl 1422.68310
[45] Parnas, M., Ron, D.: Approximating the minimum vertex cover in sublinear time and a connection to distributed algorithms. Theor. Comput. Sci. 381(1-3), 183-196 (2007). doi:10.1016/j.tcs.2007.04.040 · Zbl 1188.68358 · doi:10.1016/j.tcs.2007.04.040
[46] Poosala, V., Ioannidis, Y.E.: Selectivity estimation without the attribute value independence assumption. In: VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pp. 486-495 (1997). http://www.vldb.org/conf/1997/P486.PDF
[47] Przulj, N., Corneil, D.G., Jurisica, I.: Modeling interactome: scale-free or geometric? Bioinformatics 20(18), 3508-3515 (2004). doi:10.1093/bioinformatics/bth436 · doi:10.1093/bioinformatics/bth436
[48] Scott, J., Ideker, T., Karp, R.M., Sharan, R.: Efficient algorithms for detecting signaling pathways in protein interaction networks. J. Comput. Biol. 13(2), 133-144 (2006). doi:10.1089/cmb.2006.13.133 · Zbl 1119.92331 · doi:10.1089/cmb.2006.13.133
[49] Shlomi, T., Segal, D., Ruppin, E., Sharan, R.: Qpath: a method for querying pathways in a protein-protein interaction network. BMC Bioinform. 7, 199 (2006). doi:10.1186/1471-2105-7-199 · doi:10.1186/1471-2105-7-199
[50] Swami, A.N., Schiefer, K.B.: On the estimation of join result sizes. In: Advances in Database Technology—EDBT’94. 4th International Conference on Extending Database Technology, Cambridge, United Kingdom, March 28-31, 1994, Proceedings, pp. 287-300 (1994). doi:10.1007/3-540-57818-8_58
[51] Wernicke, S.: Efficient detection of network motifs. IEEE/ACM Trans. Comput. Biology Bioinform. 3(4), 347-359 (2006). doi:10.1109/TCBB.2006.51 · doi:10.1109/TCBB.2006.51
[52] Williams, R.: Finding paths of length k in \[\text{ o }{}^* (2^{\text{ k }})\] o∗(2k) time. Inf. Process. Lett. 109(6), 315-318 (2009). doi:10.1016/j.ipl.2008.11.004 · Zbl 1191.68857 · doi:10.1016/j.ipl.2008.11.004
[53] Williams, V.V., Williams, R.: Finding, minimizing, and counting weighted subgraphs. SIAM J. Comput. 42(3), 831-854 (2013). doi:10.1137/09076619X · Zbl 1275.68080 · doi:10.1137/09076619X
[54] Yoshida, Y., Yamamoto, M., Ito, H.: Improved constant-time approximation algorithms for maximum matchings and other optimization problems. SIAM J. Comput. 41(4), 1074-1093 (2012). doi:10.1137/110828691 · Zbl 1253.68380 · doi:10.1137/110828691
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.