zbMATH — the first resource for mathematics

Improving MinHash via the containment index with applications to metagenomic analysis. (English) Zbl 1428.68276
Summary: MinHash is a probabilistic method for estimating the similarity of two sets in terms of their Jaccard index, defined as the ratio of the size of their intersection to their union. We demonstrate that this method performs best when the sets under consideration are of similar size and the performance degrades considerably when the sets are of very different size. We introduce a new and efficient approach, called the containment MinHash approach, that is more suitable for estimating the Jaccard index of sets of very different size. We accomplish this by leveraging another probabilistic method (in particular, Bloom filters) for fast membership queries. We derive bounds on the probability of estimate errors for the containment MinHash approach and show it significantly improves upon the classical MinHash approach. We also show significant improvements in terms of time and space complexity. As an application, we use this method to detect the presence/absence of organisms in a metagenomic data set, showing that it can detect the presence of very small, low abundance microorganisms.
68T10 Pattern recognition, speech recognition
68P05 Data structures
92D10 Genetics and epigenetics
Full Text: DOI
[1] Bender, M. A.; Farach-Colton, M.; Johnson, R.; Kraner, R.; Kuszmaul, B. C.; Medjedovic, D.; Montes, P.; Shetty, P.; Spillane, R. P.; Zadok, E., Don’t thrash: how to cache your hash on flash, Proc. VLDB Endow., 5, 11, 1627-1637 (2012)
[2] Bloom, B. H., Space/time trade-offs in hash coding with allowable errors, Commun. ACM, 13, 7, 422-426 (1970) · Zbl 0195.47003
[3] Broder, A. Z., On the resemblance and containment of documents, Proceedings of the Compression and Complexity of Sequences 1997. Proceedings, 21-29 (1997), IEEE
[4] Broder, A. Z.; Charikar, M.; Frieze, A. M.; Mitzenmacher, M., Min-wise independent permutations, J. Comput. Syst. Sci., 60, 3, 630-659 (2000) · Zbl 0958.68047
[5] Brown, T. C.; Irber, L., Sourmash: a library for minhash sketching of dna, J. Open Source Softw., 1, 5 (2016)
[6] Chikhi, R.; Rizk, G., Space-efficient and exact De Bruijn graph representation based on a bloom filter, Proceedings of the International Workshop on Algorithms in Bioinformatics, 236-248 (2012), Springer
[7] Chu, J.; Sadeghi, S.; Raymond, A.; Jackman, S. D.; Nip, K. M.; Mar, R.; Mohamadi, H.; Butterfield, Y. S.; Robertson, A. G.; Birol, I., Biobloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters, Bioinformatics, 30, 23, 3402-3404 (2014)
[8] Cohen, E.; Datar, M.; Fujiwara, S.; Gionis, A.; Indyk, P.; Motwani, R.; Ullman, J. D.; Yang, C., Finding interesting associations without support pruning, IEEE Trans. Knowl. Data Eng., 13, 1, 64-78 (2001)
[9] Das, A. S.; Datar, M.; Garg, A.; Rajaram, S., Google news personalization: scalable online collaborative filtering, Proceedings of the 16th International Conference on World Wide Web, 271-280 (2007), ACM
[10] Fan, B.; Andersen, D. G.; Kaminsky, M.; Mitzenmacher, M. D., Cuckoo filter: practically better than bloom, Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies, 75-88 (2014), ACM
[11] Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F., Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. aofa: analysis of algorithms, Discrete Mathematics and Theoretical Computer Science, 137-156 (2007) · Zbl 1192.68959
[12] Heo, Y.; Wu, X. L.; Chen, D.; Ma, J.; Hwu, W. M., Bless: bloom filter-based error correction solution for high-throughput sequencing reads, Bioinformatics, 30, 10, 1354-1362 (2014)
[13] Howe, A. C.; Jansson, J. K.; Malfatti, S. A.; Tringe, S. G.; Tiedje, J. M.; Brown, C. T., Tackling soil diversity with the assembly of large, complex metagenomes, Proc. Natl. Acad. Sci., 111, 13, 4904-4909 (2014)
[14] Jarccard, P., Nouvelles recherches sur la distribution floral, Bull. Soc. Vaud. Sc. Nat., 44, 163, 223-270 (1908)
[15] McElroy, K. E.; Luciani, F.; Thomas, T., Gemsim: general, error-model based simulator of next-generation sequencing data, BMC Genomics, 13, 1, 74 (2012)
[16] Melsted, P.; Pritchard, J. K., Efficient counting of k-mers in dna sequences using a bloom filter, BMC Bioinformatics, 12, 1, 333 (2011)
[17] Mitzenmacher, M.; Upfal, E., Probability and Computing: Randomized Algorithms and Probabilistic Analysis (2005), Cambridge University Press · Zbl 1092.60001
[18] Ondov, B. D.; Treangen, T. J.; Melsted, P.; Mallonee, A. B.; Bergman, N. H.; Koren, S.; Phillippy, A. M., Mash: fast genome and metagenome distance estimation using minhash, Genome Biology, 17, 1, 132 (2016)
[19] Pell, J.; Hintze, A.; Canino-Koning, R.; Howe, A.; Tiedje, J. M.; Brown, C. T., Scaling metagenome sequence assembly with probabilistic De Bruijn graphs, Proc. Natl. Acad. Sci., 109, 33, 13272-13277 (2012) · Zbl 1256.68052
[20] Pugh, W., Skip lists: a probabilistic alternative to balanced trees, Commun. ACM, 33, 6, 668-676 (1990)
[21] Solomon, B.; Kingsford, C., Fast search of thousands of short-read sequencing experiments, Nat. Biotechnol., 34, 3, 300 (2016)
[22] Stranneheim, H.; Käller, M.; Allander, T.; Andersson, B.; Arvestad, L.; Lundeberg, J., Classification of dna sequences using bloom filters, Bioinformatics, 26, 13, 1595-1600 (2010)
[23] Wheeler, D. L.; Barrett, T.; Benson, D. A.; Bryant, S. H.; Canese, K.; Chetvernin, V.; Church, D. M.; Cuccio, M. D.; Edgar, R.; Federhen, S., Database resources of the national center for biotechnology information, Nucleic Acids Res., 36, D13-D21 (2007)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.