Tracking join and self-join sizes in limited storage.

*(English)*Zbl 1051.68136Summary: This paper presents algorithms for tracking (approximate) join and self-join sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and self-join sizes without an expensive recomputation from the base data, and without the large space overhead required to maintain such sizes exactly. Query optimizers rely on fast, high-quality estimates of join sizes in order to select between various join plans, and estimates of self-join sizes are used to indicate the degree of skew in the data. For self-joins, we consider two approaches proposed by N. Alon, Y. Matias and M. Szegedy [J. Comput. Syst. Sci. 58, 137–147 (1999; Zbl 0938.68153)], which we denote tug-of-war and sample-count. We present fast algorithms for implementing these approaches, and extensions to handle deletions as well as insertions. We also report on the first experimental study of the two approaches, on a range of synthetic and real-world data sets. Our study shows that tug-of-war provides more accurate estimates for a given storage limit than sample-count, which in turn is far more accurate than a standard sampling-based approach. For example, tug-of-war needed only 4-256 memory words, depending on the data set, in order to estimate the self-join size to within a 15% relative error; on average, this is over 4 times (50 times) fewer memory words than needed by sample-count (standard sampling, resp.) to obtain a similar accuracy. For joins, we propose schemes based on maintaining a small signature of each relation independently, such that join sizes can be quickly and accurately estimated between any pair of relations using only these signatures. We show that taking random samples for join signatures can lead to an inaccurate estimation unless the sample size is quite large; moreover, we show that no other signature scheme can significantly improve upon sampling without further assumptions. These negative results are shown to hold even in the presence of sanity bounds. On the other hand, we present a fast join signature scheme based on tug-of-war signatures that provides guarantees on join size estimation as a function of the self-join sizes of the joining relations; this scheme can significantly improve upon the sampling scheme.

##### MSC:

68W05 | Nonnumerical algorithms |

PDF
BibTeX
XML
Cite

\textit{N. Alon} et al., J. Comput. Syst. Sci. 64, No. 3, 719--747 (2002; Zbl 1051.68136)

Full Text:
DOI

##### References:

[1] | Acharya, S.; Gibbons, P.B.; Poosala, V., Congressional samples for approximate answering of group-by queries, Proc. ACM SIGMOD international conf. on management of data, (2000), p. 487-498 |

[2] | Acharya, S.; Gibbons, P.B.; Poosala, V.; Ramaswamy, S., Join synopses for approximate query answering, Proc. ACM SIGMOD international conf. on management of data, (1999), p. 275-286 |

[3] | Alon, N.; Matias, Y.; Szegedy, M., The space complexity of approximating the frequency moments, J. comput. system sci., 58, 137-147, (1999) · Zbl 0938.68153 |

[4] | Barbará, D.; DuMouchel, W.; Faloutsos, C.; Haas, P.J.; Hellerstein, J.M.; Ioannidis, Y.; Jagadish, H.V.; Johnson, T.; Ng, R.; Poosala, V.; Ross, K.A.; Sevcik, K.C., The new jersey data reduction report, Bull. tech. comm. data engrg., 20, 3-45, (1997) |

[5] | Charikar, M.; Chaudhuri, S.; Motwani, R.; Narasayya, V., Towards estimation error guarantees for distinct values, Proc. 19th ACM symp. on principles of database systems, (2000), p. 268-279 |

[6] | Chaudhuri, S.; Das, G.; Narasayya, V., A robust, optimization-based approach for approximate answering of aggregate queries, Proc. ACM SIGMOD international conf. on management of data, (2001), p. 295-306 |

[7] | Chakrabarti, K.; Garofalakis, M.; Rastogi, R.; Shim, K., Approximate query processing using wavelets, Proc. 26th international conf. on very large data bases, (2000), p. 111-122 |

[8] | Ganguly, S.; Gibbons, P.B.; Matias, Y.; Silberschatz, A., Bifocal sampling for skew-resistant join size estimation, Proc. ACM SIGMOD international conf. on management of data, (1996), p. 271-281 |

[9] | Gibbons, P.B., Distinct sampling for highly-accurate answers to distinct values queries and event reports, Proc. 27th international conf. on very large data bases, (2001), p. 541-550 |

[10] | Gilbert, A.C.; Kotidis, Y.; Muthukrishnan, S.; Strauss, M.J., Surfing wavelets on streams: one-pass summaries for approximate aggregate queries, Proc. 27th international conf. on very large data bases, (2001), p. 79-88 |

[11] | Ganti, V.; Lee, M.-L.; Ramakrishnan, R., ICICLES: self-tuning samples for approximate query answering, Proc. 26th international conf. on very large data bases, (2000), p. 176-187 |

[12] | Gibbons, P.B.; Matias, Y., New sampling-based summary statistics for improving approximate query answers, Proc. ACM SIGMOD international conf. on management of data, (1998), p. 331-342 |

[13] | Gibbons, P.B.; Matias, Y., Synopsis data structures for massive data sets, (), 39-70 · Zbl 0952.68040 |

[14] | Gibbons, P.B.; Matias, Y.; Poosala, V., Fast incremental maintenance of approximate histograms, Proc. 23rd international conf. on very large data bases, (1997), p. 466-475 |

[15] | Good, I.J., Surprise indexes and p-values, J. statist. comput. simul., 32, 90-92, (1989) |

[16] | Haas, P.; Hellerstein, J., Ripple joins for online aggregation, Proc. ACM SIGMOD international conf. on management of data, (1999), p. 287-298 |

[17] | Hellerstein, J.M.; Haas, P.J.; Wang, H.J., Online aggregation, Proc. ACM SIGMOD international conf. on management of data, (1997), p. 171-182 |

[18] | Haas, P.J.; Naughton, J.F.; Seshadri, S.; Swami, A.N., Fixed-precision estimation of join selectivity, Proc. 12th ACM symp. on principles of database systems, (1993), p. 190-201 |

[19] | Haas, P.J.; Naughton, J.F.; Seshadri, S.; Stokes, L., Sampling-based estimation of the number of distinct values of an attribute, Proc. 21st international conf. on very large data bases, (1995), p. 311-322 |

[20] | Hou, W.-C.; Özsoyoglu, G.; Taneja, B.K., Statistical estimators for relational algebra expressions, Proc. 7th ACM symp. on principles of database systems, (1988), p. 276-287 |

[21] | Ioannidis, Y.E.; Poosala, V., Balancing histogram optimality and practicality for query result size estimation, Proc. ACM SIGMOD international conf. on management of data, (1995), p. 233-244 |

[22] | Ioannidis, Y.; Poosala, V., Histogram-based techniques for approximating set-valued query-answers, Proc. 25th international conf. on very large databases, (1999), p. 174-185 |

[23] | Lazaridis, I.; Mehrotra, S., Progressive approximate aggregate queries with a multi-resolution tree structure, Proc. ACM SIGMOD international conf. on management of data, (2001), p. 401-412 |

[24] | Lipton, R.J.; Naughton, J.F., Query size estimation by adaptive sampling, J. comput. system sci., 51, 18-25, (1995) · Zbl 0831.68035 |

[25] | Lipton, R.J.; Naughton, J.F.; Schneider, D.A., Practical selectivity estimation through adaptive sampling, Proc. ACM SIGMOD international conf. on management of data, (1990), p. 1-12 |

[26] | Matias, Y.; Vitter, J.S.; Wang, M., Dynamic maintenance of wavelet-based histograms, Proc. 26th international conf. on very large data bases, (2000), p. 101-110 |

[27] | Olken, F., Random sampling from databases, (1993), University of California-BerkeleyComputer Science |

[28] | Poosala, V., Histogram-based estimation techniques in databases, (1997), University of Wisconsin-Madison |

[29] | Vitter, J.S., Random sampling with a reservoir, ACM trans. math. software, 11, 37-57, (1985) · Zbl 0562.68028 |

[30] | Vrbsky, S.V.; Liu, J.W.S., Approximate—A query processor that produces monotonically improving approximate answers, IEEE trans. knowledge data engrg., 5, 1056-1068, (1993) |

[31] | Vitter, J.S.; Wang, M., Approximate computation of multidimensional aggregates of sparse data using wavelets, Proc. ACM SIGMOD international conf. on management of data, (1999), p. 193-204 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.