×

On the distortion of locality sensitive hashing. (English) Zbl 1421.68022

Summary: Given a notion of pairwise similarity between objects, locality sensitive hashing (LSH) aims to construct a hash function family over the universe of objects such that the probability two objects hash to the same value is their similarity. LSH is a powerful algorithmic tool for large scale applications and much work has been done to understand LSHable similarities, i.e., similarities that admit an LSH. In this paper we focus on similarities that are provably non-LSHable and propose a notion of distortion to capture the approximation of such a similarity by an LSHable similarity. We consider several well-known non-LSHable similarities and show tight upper and lower bounds on their distortion.

MSC:

68P05 Data structures
68P10 Searching and sorting
68Q25 Analysis of algorithms and problem complexity
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] M. R. Anderberg, {Cluster Analysis for Applications}, Academic Press, New York, 1973. · Zbl 0299.62029
[2] A. Andoni and P. Indyk, {Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions}, in Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, IEEE, Washington, DC, 2006, pp. 459-468.
[3] A. Andoni and P. Indyk, {Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions}, Comm. ACM, 51 (2008), pp. 117-122.
[4] A. Andoni, T. Laarhoven, I. Razenshteyn, and E. Waingarten, {Optimal Hashing-Based Time-Space Trade-Offs for Approximate Near Neighbors}, in Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’17, SIAM, Philadelphia, 2017, pp. 47-66, . · Zbl 1410.68091
[5] I. Avcibaş, M. Kharrazi, N. Memon, and B. Sankur, {Image steganalysis with binary similarity measures}, EURASIP J. Adv. Signal Process., 2005 (2005), pp. 2749-2757. · Zbl 1102.94013
[6] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, {Optimal decoding of linear codes for minimizing symbol error rate}, IEEE Trans. Inform. Theory, 20 (1974), pp. 284-287. · Zbl 0322.94005
[7] C. H. Bennett and P. W. Shor, {Quantum information theory}, IEEE Trans. Inform. Theory, 44 (1998), pp. 2724-2742. · Zbl 1099.81501
[8] J. Braun, {Die Vegetationsverhältnisse der Schneestufe in den Rätisch-Lepontischen Alpen: Ein Bild des Pflanzenlebens an seinen äussersten Grenzen}, Schweizerische Naturforschende Gesellschaft, 1913.
[9] A. Z. Broder, {On the resemblance and containment of documents}, in Proceedings of the Compression and Complexity of Sequences, IEEE, Washington, DC, 1997, pp. 21-29.
[10] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, {Min-wise independent permutations}, J. Comput. System Sci., 60 (2000), pp. 630-659. · Zbl 0958.68047
[11] R. Burke, {Hybrid recommender systems: Survey and experiments}, User Model. User-Adapt. Interact., 12 (2002), pp. 331-370. · Zbl 1030.68607
[12] J. Chabalier, J. Mosser, and A. Burgun, {A transversal approach to predict gene product networks from ontology-based similarity}, BMC Bioinformatics, 8 (2007), 235.
[13] M. Charikar, {Similarity estimation techniques from rounding algorithms}, in Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, ACM, New York, 2002, pp. 380-388. · Zbl 1192.68226
[14] F. Chierichetti and R. Kumar, {LSH-preserving functions and their applications}, J. ACM, 62 (2015), 33. · Zbl 1421.68020
[15] F. Chierichetti, R. Kumar, and M. Mahdian, {The complexity of LSH feasibility}, Theoret. Comput. Sci., 530 (2014), pp. 89-101. · Zbl 1359.68122
[16] M. M. Deza and E. Deza, {Encyclopedia of Distances}, Springer-Verlag, Berlin, 2009. · Zbl 1167.51001
[17] M. M. Deza and M. Laurent, {Geometry of Cuts and Metrics}, 1st ed., Springer, Heidelberg, 2009. · Zbl 0885.52001
[18] L. R. Dice, {Measures of the amount of ecologic association between species}, Ecology, 26 (1945), pp. 297-302.
[19] S. J. Dixon, N. Heinrich, M. Holmboe, M. L. Schaefer, R. R. Reed, J. Trevejo, and R. G. Brereton, {Use of cluster separation indices and the influence of outliers: Application of two new separation indices, the modified silhouette index and the overlap coefficient to simulated data and mouse urine metabolomic profiles}, J. Chemom., 23 (2009), pp. 19-31.
[20] W. H. Equitz and T. M. Cover, {Successive refinement of information}, IEEE Trans. Inform. Theory, 37 (1991), pp. 269-275. · Zbl 0721.94005
[21] P. Erdös, C. Ko, and R. Rado, {Intersection theorems for systems of finite sets}, Quart. J. Math. Oxford (2), 12 (1961), pp. 313-320. · Zbl 0100.01902
[22] P. Frankl and Z. Füredi, {Non-trivial intersecting families}, J. Combin. Theorey Ser. A, 41 (1986), pp. 150-153. · Zbl 0583.05002
[23] G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum, {Information retrieval using a singular value decomposition model of latent semantic structure}, in Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 1988, pp. 465-480.
[24] A. Gionis, P. Indyk, R. Motwani, et al., {Similarity search in high dimensions via hashing}, in Proceedings of the 25th International Conference on Very Large Data Bases, Morgan Kaufmann, San Francisco, 1999, pp. 518-529.
[25] A. Hilton and E. Milner, {Some intersection theorems for systems of finite sets}, Quart. J. Math. Oxford Ser. (2), 18 (1967), pp. 369-384. · Zbl 0168.26205
[26] P. Indyk and J. Matousek, {Low-Distortion Embeddings of Finite Metric Spaces}, Handbook of Discrete and Computational Geometry, CRC Press, Boca Raton, FL, 2004, pp. 177-196.
[27] P. Indyk and R. Motwani, {Approximate nearest neighbors: Towards removing the curse of dimensionality}, in Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, ACM, New York, 1998, pp. 604-613. · Zbl 1029.68541
[28] P. Indyk, R. Motwani, P. Raghavan, and S. Vempala, {Locality-preserving hashing in multidimensional spaces}, in Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, ACM, New York, 1997, pp. 618-625. · Zbl 0963.68045
[29] D. H. Knight, {A Phytosociological Analysis of Species-Rich Tropical Forest on Barro Colorado Island, Panama}, Ecol. Monogr., 45 (1975), pp. 259-284.
[30] P. Koleff, K. J. Gaston, and J. J. Lennon, {Measuring beta diversity for presence-absence data}, Jo. Animal Ecol., 72 (2003), pp. 367-382.
[31] J. Looman and J. Campbell, {Adaptation of Sorensen’s \(k (1948)\) for estimating unit affinities in prairie vegetation}, Ecology, 41 (1960), pp. 409-416.
[32] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, {Multi-probe LSH: Efficient indexing for high-dimensional similarity search}, in Proceedings of the 33rd International Conference on Very Large Data Bases, 2007, pp. 950-961.
[33] E. Manders, F. Verbeek, and J. Aten, {Measurement of co-localization of objects in dual-colour confocal images}, J. Microsc., 169 (1993), pp. 375-382.
[34] R. Mihalcea, C. Corley, and C. Strapparava, {Corpus-based and knowledge-based measures of text semantic similarity}, in Proceedings of the 21st National Conference on Artificial Intelligence, 2006, pp. 775-780.
[35] R. Motwani, A. Naor, and R. Panigrahy, {Lower bounds on locality sensitive hashing}, SIAM J. Discret. Math., 21 (2007), pp. 930-935, . · Zbl 1158.68012
[36] R. O’Donnell, Y. Wu, and Y. Zhou, {Optimal lower bounds for locality-sensitive hashing (except when q is tiny)}, ACM Trans. Comput. Theory, 6 (2014), 5, .
[37] M. Poore, {The use of phytosociological methods in ecological investigations: I. The Braun-Blanquet system}, J. Ecol., 43 (1955), pp. 226-244.
[38] D. J. Rogers and T. T. Tanimoto, {A computer program for classifying plants}, Science, 132 (1960), pp. 1115-1118.
[39] P. Rychlỳ, {A lexicographer-friendly association score}, in Proceedings of the 2nd Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN, 2008, pp. 6-9.
[40] G. Salton, {Developments in automatic text retrieval}, Science, 253 (1991), pp. 974-980.
[41] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, {Item-based collaborative filtering recommendation algorithms}, in Proceedings of the 10th International Conference on World Wide Web, 2001, ACM, New York, pp. 285-295.
[42] V. Satuluri and S. Parthasarathy, {Bayesian locality sensitive hashing for fast similarity search}, Proceedings of the VLDB Endowment, 5 (2012), pp. 430-441.
[43] A. Shmida and M. V. Wilson, {Biological determinants of species diversity}, J. Biogeogr., 12 (1985), pp. 1-20.
[44] P. Sneath and R. Johnson, {The influence on numerical taxonomic similarities of errors in microbiological tests}, J. Gen. Microbiol., 72 (1972), pp. 377-392.
[45] P. H. A. Sneath and R. R. Sokal, {Numerical Taxonomy: The Principles and Practice of Numerical Classification}, W. H. Freeman, San Francisco, 1973. · Zbl 0285.92001
[46] T. Sørensen, {A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons}, Biol. Skr., 5 (1948), pp. 1-34.
[47] J. Wang, H. T. Shen, J. Song, and J. Ji, {Hashing for Similarity Search: A Survey}, preprint, , 2014.
[48] R. H. Whittaker, {Evolution and measurement of species diversity}, Taxon, (1972), pp. 213-251.
[49] S. Williams, M. Goodfellow, G. Alderson, E. Wellington, P. Sneath, and M. Sackin, {Numerical classification of Streptomyces and related genera}, Journal of General Microbiology, 129 (1983), pp. 1743-1813.
[50] S. M. Wong, W. Ziarko, and P. C. Wong, {Generalized vector spaces model in information retrieval}, in Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 1985, pp. 18-25.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.